An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20100324/2e3a9574/attachment.pl>
missing data + explanatory variables
3 messages · Christophe Dutang, Emmanuel Charpentier
I have been fighting (part of) these questions, too. Below some partial an temporary answers. Le mercredi 24 mars 2010 ? 12:09 +0100, christophe dutang a ?crit :
Dear list, I have two problems when I try to use mixed models. First as far as I know, there are two main implementations of mixed models: lme4 and MCMCglmm. I try to model a binary response variable over a small period of time. The problem is that for some lines, the response is missing. In this mailing list archive, I do not find response to this question. https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q4/002940.htmlproposes the MCMCglmm but I check the package and missing data are not handled https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q3/002579.html, no solution https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q1/001794.htmlproposes the EM algorithm to solve the problem https://stat.ethz.ch/pipermail/r-sig-mixed-models/2008q3/001188.html says missing data in covariates can be handled directly by R or SAS. As I'm a beginner in the use of mixed models, I spent a lot of times reading on this topic. And in the book of Edward Frees 'longitudinal and panel data', estimation seems to be available for missing data? And there is an article promoting the use of mixed model for that feature: 'A Comparison of the General Linear Mixed Model and Repeated Measures ANOVA Using a Dataset with Multiple Missing Data Points'
I'd be interested in references for this paper. Google led me to :
@article{krueger_tian_2004, title={A Comparison of the General Linear
Mixed Model and Repeated Measures ANOVA Using a Dataset with Multiple
Missing Data Points}, volume={6}, DOI={10.1177/1099800404267682},
number={2}, journal={Biol Res Nurs}, author={Krueger, Charlene and Tian,
Lili}, year={2004}, month={Oct}, pages={151-157}} ,
and it does not seem to be available in any of my country's (France)
academic libraries : the closest sources I can get are in Germany or
UK...
I'm a bit surprised by the abstract note (edited out above). Any model
can (theoretically) be used to impute missing data ; theoretical work of
Rubin and followers have shown that a) this was desirable, since
discarding incomplete observations ("complete-cases analysis") would
lead in many cases to biased estimates, b) such imputation could be done
"semi-automatically" in two special cases ("missing completely at
random" and "missing at random" data (in other cases, one must *also*
model the missing data mechanism), by specifying a distribution for
variables with missing data, c) that such imputation should be done at
least few times in order to estimate the excess variability that this
estimation procedure adds to estimators, and d) that such multiple
imputations should be combined by incorporation of this excess
variability.
Mixed models explicitly model part of the inter-individual variation, by
splitting it between between-groups and within-group variabilities.
Therefore, imputation might be more precise. But they do not, *by
themselves*, allow for imputation.
Bayesian estimation of mixed model parameters, which is becoming popular
thanks to the BUGS language and recent textbooks, such as Gelman & Hill
(2007) (recommended reading, even if you are no bound to adhere to all
conclusions...), sort-of facilitates this imputation process, but only
in the sense that it is (relatively) easy to specify (at least in BUGS)
a (reasonable) a priori distribution for any variable (but it is not
always easy to obtain and assess numerical convergence...).
Other solutions have been proposed. At least three package aim at
proposing a "reasonable" imputation model for missing data : AmeliaII
(which I did not fully explore, since it is a bit "closed"), mice (which
is quite "open", but whose current version (2.3) has some problems in
specifying specialized imputation functions) and mi (Gelman & his gang,
close to the ideas of the textbook mentioned above), which I did not yet
explore fully, seems interesting but a bit awkward if you need to use
specialized imputation functions.
Various packages allow for estimation from a multiply-imputed dataset :
mice and mi, of course, , but also mitools and (reportedly) Zelig. The
only one that tries to implement hypothesis testing (e. g. test for the
"significance" of a whole factor) is mice. But I'm less and less
convinced of the necessity of such a procedure, notwithstanding journal
editor's wishes, whims and tantrums (see below).
Bayesian estimation "automatically" incorporate missing data's added
uncertainty, since it uses a full probability model for imputing them
(it can also incorportate relevant prior information if available, which
could be extremely valuable but might invalidate your imputations from a
"strict frequentist" point of view). But it can do so only if it is
build for using and modeling all data. Some specialized software, such
as MCMCpack, are built on the assumption of a complete dataset and
assumptions of a predefined shape of a priori distributions of the
variables (quite often a so-called 'uninformative' distribution). If you
have indeed missing covariates, it won't model it and exclude the
relevant observations, thus leading to the same problems that plague
"complete-cases analysis".
Similarly, lmer, as far as I can tell, does assume some shape of the
group-level coefficients and fo the distribution of the dependant
variable given its predictors, but won't impute missing data. Ditto, as
far as I can tell, for MCMCglmm.
So I do not understand why we can't have missing data?
Because there is no probability model for imputation (nor an estimation combining procedure) in (g)lmer nor in MCMCglmm.
Secondly, the presentation of D. Bates done at Max Planck Institute in 2009 states that p-values are not available for mixed models because the distribution of parameter estimators are not known. My question is how can we know that an explanatory variable is significant? is the only tool to fit another model without the variable and to use the anova function?
Because 1) you did not specify *what* is a "significant variable" :-), and 2) the exact distribution of the possible "test statistics" (Wald ? Score ? Likelihood ratios ? Ad-hoc relevant statistic ?) are not known at least in the general case. The simplified models that were proposed about 50 years ago for the (very) special case of *balanced* datasets resulting from *designed* experiments (implemented in the aov() function) do not hold for the ((much) more) general case (g)lmer aims to implement. Look in the R-help archives for a long discussion of the problem of such hypothesis testing. Douglas Bates stated (rightly, IMNSHO) that reproducing "what SAS does" was *not*, to his eyes, a good enough reason to implement it, and explained (some of) his misgivings. See also his book (Pinhero & bates (2000)) which gives good examples of the problem (met with nlme, predecessor to lme4). The proposed solution is to use MCMC sampling from the distributions proposed as a solution by (g)lmer, and use this as a basis for taking such decisions (that is what hypothesis testing aims to do). The fly in the ointment is that, as far as I can tell, the relevant functions are currently *broken* in current "stable" and "development" versions of lme4 (and not yet written for some non-Gaussian cases). This has been discussed on this list. But the crux of the matter is that the second, technical point might be not as important as the first : barrels of ink were spent discussing the *epistemological* status of "significance" in hypothesis testing, which became "standard operating procedure" probably for reasons having little relevance to sound epistemology. Nowadays, we use electrons, but the pendulum seems to be starting in the other direction : confidence interval estimation is now often regarded as a better indication of the importance of your findings than a "p-value". A lot more could be written about the use and misuse of hypothesis testing, and even much more bout the (possible) relevance of Bayesian analysis of multilevel models and its interpretation, but "this is another story" an I've probably been already too long... A look at the relevant literature should keep you amused, sometimes bored, but anyway busy for quite a bit of time:-). HTH, Emmanuel Charpentier, DDS, MSc
Thanks in advance Christophe
Some additions to what I wrote yesterday : I misunderstood the aim of the paper as explained in its abstract, and started too far ... Le mercredi 24 mars 2010 ? 12:09 +0100, christophe dutang a ?crit :
Dear list, I have two problems when I try to use mixed models. First as far as I know, there are two main implementations of mixed models: lme4 and MCMCglmm. I try to model a binary response variable over a small period of time. The problem is that for some lines, the response is missing. In this mailing list archive, I do not find response to this question. https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q4/002940.htmlproposes the MCMCglmm but I check the package and missing data are not handled https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q3/002579.html, no solution https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q1/001794.htmlproposes the EM algorithm to solve the problem https://stat.ethz.ch/pipermail/r-sig-mixed-models/2008q3/001188.html says missing data in covariates can be handled directly by R or SAS. As I'm a beginner in the use of mixed models, I spent a lot of times reading on this topic. And in the book of Edward Frees 'longitudinal and panel data', estimation seems to be available for missing data? And there is an article promoting the use of mixed model for that feature: 'A Comparison of the General Linear Mixed Model and Repeated Measures ANOVA Using a Dataset with Multiple Missing Data Points' So I do not understand why we can't have missing data?
Okay. I have been able to lay my eyes on this article. The point it makes is that the "mixed model" approach to analysing repeated measurements on the same subject allows you to use observations made on this subject having complete data, discarding only the *observations* with missing data, whereas the old "repeated measures ANOVA" would lead you to ignore *all* observations made on a *subject* having *one* observation with missing data. This is due to the fact that what the article calls the "repeated measures ANOVA" algorithms were (geometrically very smart) *manually* computable simplifications of a more general procedure involving, in the general case, manually intractable computation (multiple matrix inversions of not-so-small dimensions) ; these simplifications (and the resulting algorithms) were valid only for at least partially *balanced* datasets. (g)lmer() (and its predecessor lme()) of course allow for this use of data on incompletely documented *subjects*. This is so obvious to users of "modern" software (i. e. using something else than transcriptions of manual computation algorithms) that it is no longer mentioned as an issue. The situation is similar to 2-way fixed-effects ANOVA, where the "manual" algorithm I was taught ... too long a time ago *demanded* a balanced datasets, which is a requirement that ended even with BMDP (end of 70's, IIRC). Similarly, nowhere I'm aware of the authors of lm() mention that multiway-models do not have to be balanced : that's granted... However, what the authors of that papers of yours do *not* mention is that analyzing this incomplete dataset by using all the complete *observations*, while better that using only complete *subjects*, might still lead to biased estimates, and that debiasing them should involve multiple imputations of missing data in incomplete *observations*. That subject was well explored by Rubin (since the 80's IIRC), and involves the specialized packages I mentioned yesterday ... or turning to Bayesian estimation, which is a horse of an entirely different color. Look up the 2nd edition of Rubin's book (ca 1998 IIRC) on the subject for (much) better explanations. To summarize : Repeated-measures ANOVA : manually tractable, needs balanced datasets, therefore forces you to ignore incomplete *subjects*. Obsolete (but smart an historically important). Mixed-model ANOVA : manually intractable, accepts unbalanced datasets but does not allow for partial observations, therefore forces you to ignore incomplete *observations*. Modern solution, but does not accounts for possible bias due to missing data. ===== Your paper stops here (and I started here yesterday) ===== Multiple imputations + Mixed-model ANOVA : allows you to use all available information, estimates the loss of information incurred by missing data and attempts to make up for it. Best frequentist (classical) solution, needs specialized software, and might require special modeling efforts of the missing-data mechanism if "missing at random" is not "obviously" reasonable. Bayesian modeling : requires serious efforts to model both the phenomenon of interest, its covariates and possibly the missing-data mechanism and a priori information, needs some awareness of the computational difficulties (not always solvable), current tools not yet perfected, but is (theoretically) the best possible solution since it attempts to model the joint distribution of all data, including their missingness. The answers it leads to (distributions, credible intervals, Bayes factors, probabilities) have intuitive meanings, quite different of "frequentist" confidence intervals and p-values, and might not (yet) be accepted in some circles insisting, for example, on hypothesis testing. HTH, Emmanuel Charpentier PS : since "manual" algorithms are out of practical use since the end of the 70s and the inception of what was then called "personal computers", I'm a bit surprised that a paper published in 2004 still invokes that issue... Is your domain special (or especially conservative) ?