Skip to content

data cloning. have you seen this?

2 messages · Paul Johnson, David Duffy

#
Hey, everybody:

Have you seen these papers that use "data cloning" for
hierarchical/mixed models? I'm pasting in 2 bibtex cites.  The claims
are so fantastic that I can't hardly believe them.   One can obtain ML
estimates and information matrix from an ensemble of  MCMC estimates
derived from clones of a data set.  I don't know how that is different
from averaging a lot of MCMC chains together, it sure seems like it.

I don't have an axe to grind here.  I'm asking you, the smartest folks
I know ( :) ), what you think?

(I found this by accident. The rjags package turned up with a reverse
depends on the package "dclone" and I was curious to know what dclone
is for. The man pages in dclone point at the first Lele et al article
below. )

I don't know how this addresses the problem that estimates of variance
estimates of the variance components can't be normally distributed,
even asymptotically, because they have that boundary at 0.   It seems
as though they assume that away, in the same way that many other
frequentists do.

I also wonder about the small-medium sized sample performance of this
kind of ML approximation versus a genuine Bayesian approach.

@article{lele_data_2007,
        title = {Data cloning: easy maximum likelihood estimation for
complex ecological models using Bayesian Markov chain Monte Carlo
methods},
        volume = {10},
        issn = {1461-0248},
        shorttitle = {Data cloning},
        url = {http://www.ncbi.nlm.nih.gov/pubmed/17542934},
        doi = {10.1111/j.1461-0248.2007.01047.x},
        abstract = {We introduce a new statistical computing method,
called data cloning, to calculate maximum likelihood estimates and
their standard errors for complex ecological models. Although the
method uses the Bayesian framework and exploits the computational
simplicity of the Markov chain Monte Carlo {(MCMC)} algorithms, it
provides valid frequentist inferences such as the maximum likelihood
estimates and their standard errors. The inferences are completely
invariant to the choice of the prior distributions and therefore avoid
the inherent subjectivity of the Bayesian approach. The data cloning
method is easily implemented using standard {MCMC} software. Data
cloning is particularly useful for analysing ecological situations in
which hierarchical statistical models, such as state-space models and
mixed effects models, are appropriate. We illustrate the method by
fitting two nonlinear population dynamics models to data in the
presence of process and observation noise.},
        number = {7},
        journal = {Ecology Letters},
        author = {Subhash R Lele and Brian Dennis and Frithjof Lutscher},
        month = jul,
        year = {2007},
        note = {{PMID:} 17542934},
        keywords = {Bayes Theorem, Computational Biology, Computer
Simulation, Ecology, Ecosystem, Likelihood Functions, Markov Chains,
Models, Biological, {MONTE} Carlo method, Population Dynamics},
        pages = {551--563}
},


@article{ponciano_hierarchical_2009,
        title = {Hierarchical models in ecology: confidence intervals,
hypothesis testing, and model selection using data cloning},
        volume = {90},
        issn = {0012-9658},
        shorttitle = {Hierarchical models in ecology},
        url = {http://www.esajournals.org/doi/abs/10.1890/08-0967.1},
        doi = {10.1890/08-0967.1},
        number = {2},
        journal = {Ecology},
        author = {Jos? Miguel Ponciano and Mark L. Taper and Brian
Dennis and Subhash R. Lele},
        year = {2009},
        pages = {356--362}
},
#
On Wed, 21 Jul 2010, Paul Johnson wrote:

            
As I understand the first paper, at least, they are "just" using MCMC to 
fit ML frequentist models, wuth WinBUGS being used because it is 
convenient.  I have been using data cloning for MCMC GLMM a while, and it 
does seem to improve the point estimates for the fixed effects regression 
coefficients and variance components.  I came upon it as a natural thing 
to do with a poorly mixing model on a smaller example dataset, and it was 
subsequently pointed out to me to be used in the machine learning 
literature as well.  I also decided that it was not used by WinBUGs 
because their algorithms were better formulated, and didn't need this 
crutch ;) -- it does slow things down.

Thanks for the references!

David Duffy.

PS If you are interested, I will post an example of its effects on random
effects variances for a GLMM.