Skip to content

missing data + explanatory variables

3 messages · Christophe Dutang, Emmanuel Charpentier

#
I have been fighting (part of) these questions, too. Below some partial
an temporary answers.

Le mercredi 24 mars 2010 ? 12:09 +0100, christophe dutang a ?crit :
I'd be interested in references for this paper. Google led me to :

@article{krueger_tian_2004, title={A Comparison of the General Linear
Mixed Model and Repeated Measures ANOVA Using a Dataset with Multiple
Missing Data Points}, volume={6}, DOI={10.1177/1099800404267682},
number={2}, journal={Biol Res Nurs}, author={Krueger, Charlene and Tian,
Lili}, year={2004}, month={Oct}, pages={151-157}} ,

and it does not seem to be available in any of my country's (France)
academic libraries : the closest sources I can get are in Germany or
UK...

I'm a bit surprised by the abstract note (edited out above). Any model
can (theoretically) be used to impute missing data ; theoretical work of
Rubin and followers have shown that a) this was desirable, since
discarding incomplete observations ("complete-cases analysis") would
lead in many cases to biased estimates, b) such imputation could be done
"semi-automatically" in two special cases ("missing completely at
random" and "missing at random" data (in other cases, one must *also*
model the missing data mechanism), by specifying a distribution for
variables with missing data, c) that such imputation should be done at
least  few times in order to estimate the excess variability that this
estimation procedure adds to estimators, and d) that such multiple
imputations should be combined by incorporation of this excess
variability.

Mixed models explicitly model part of the inter-individual variation, by
splitting it between between-groups and within-group variabilities.
Therefore, imputation might be more precise. But they do not, *by
themselves*, allow for imputation.

Bayesian estimation of mixed model parameters, which is becoming popular
thanks to the BUGS language and recent textbooks, such as Gelman & Hill
(2007) (recommended reading, even if you are no bound to adhere to all
conclusions...), sort-of facilitates this imputation process, but only
in the sense that it is (relatively) easy to specify (at least in BUGS)
a (reasonable) a priori distribution for any variable (but it is not
always easy to obtain and assess numerical convergence...).

Other solutions have been proposed. At least three package aim at
proposing a "reasonable" imputation model for missing data : AmeliaII
(which I did not fully explore, since it is a bit "closed"), mice (which
is quite "open", but whose current version (2.3) has some problems in
specifying specialized imputation functions) and mi (Gelman & his gang,
close to the ideas of the textbook mentioned above), which I did not yet
explore fully, seems interesting but a bit awkward if you need to use
specialized imputation functions.

Various packages allow for estimation from a multiply-imputed dataset :
mice and mi, of course, , but also mitools and (reportedly) Zelig. The
only one that tries to implement hypothesis testing (e. g. test for the
"significance" of a whole factor) is mice. But I'm less and less
convinced of the necessity of such a procedure, notwithstanding journal
editor's wishes, whims and tantrums (see below).

Bayesian estimation "automatically" incorporate missing data's added
uncertainty, since it uses a full probability model for imputing them
(it can also incorportate relevant prior information if available, which
could be extremely valuable but might invalidate your imputations from a
"strict frequentist" point of view). But it can do so only if it is
build for using and modeling all data. Some specialized software, such
as MCMCpack, are built on the assumption of a complete dataset and
assumptions of a predefined shape of a priori distributions of the
variables (quite often a so-called 'uninformative' distribution). If you
have indeed missing covariates, it won't model it and exclude the
relevant observations, thus leading to the same problems that plague
"complete-cases analysis".

Similarly, lmer, as far as I can tell, does assume some shape of the
group-level coefficients and fo the distribution of the dependant
variable given its predictors, but won't impute missing data. Ditto, as
far as I can tell, for MCMCglmm.
Because there is no probability model for imputation (nor an estimation
combining procedure) in (g)lmer nor in MCMCglmm.
Because 1) you did not specify *what* is a "significant variable" :-),
and 2) the exact distribution of the possible "test statistics" (Wald ?
Score ? Likelihood ratios ? Ad-hoc relevant statistic ?)  are not known
at least in the general case. The simplified models that were proposed
about 50 years ago for the (very) special case of *balanced* datasets
resulting from *designed* experiments (implemented in the aov()
function) do not hold for the ((much) more) general case (g)lmer aims to
implement.

Look in the R-help archives for a long discussion of the problem of such
hypothesis testing. Douglas Bates stated (rightly, IMNSHO) that
reproducing "what SAS does" was *not*, to his eyes, a good enough reason
to implement it, and explained (some of) his misgivings. See also his
book (Pinhero & bates (2000)) which gives good examples of the problem
(met with nlme, predecessor to lme4).

The proposed solution is to use MCMC sampling from the distributions
proposed as a solution by (g)lmer, and use this as a basis for taking
such decisions (that is what hypothesis testing aims to do).

The fly in the ointment is that, as far as I can tell, the relevant
functions are currently *broken* in current "stable" and "development"
versions of lme4 (and not yet written for some non-Gaussian cases). This
has been discussed on this list.

But the crux of the matter is that the second, technical point might be
not as important as the first : barrels of ink were spent discussing the
*epistemological* status of "significance" in hypothesis testing, which
became "standard operating procedure" probably for reasons having little
relevance to sound epistemology. Nowadays, we use electrons, but the
pendulum seems to be starting in the other direction : confidence
interval estimation is now often regarded as a better indication of the
importance of your findings than a "p-value".

A lot more could  be written about the use and misuse of hypothesis
testing, and even much more bout the (possible) relevance of Bayesian
analysis of multilevel models and its interpretation, but "this is
another story" an I've probably been already too long... A look at the
relevant literature should keep you amused, sometimes bored, but anyway
busy for quite a bit of time:-).

HTH,

					Emmanuel Charpentier, DDS, MSc
#
Some additions to what I wrote yesterday : I misunderstood the aim of
the paper as explained in its abstract, and started too far ...

Le mercredi 24 mars 2010 ? 12:09 +0100, christophe dutang a ?crit :
Okay. I have been able to lay my eyes on this article. The point it
makes is that the "mixed model" approach to analysing repeated
measurements on the same subject allows you to use observations made on
this subject having complete data, discarding only the *observations*
with missing data, whereas the old "repeated measures ANOVA" would lead
you to ignore *all* observations made on a *subject* having *one*
observation with missing data.

This is due to the fact that what the article calls the "repeated
measures ANOVA" algorithms were (geometrically very smart) *manually*
computable simplifications of a more general procedure involving, in the
general case, manually intractable computation (multiple matrix
inversions of not-so-small dimensions) ; these simplifications (and the
resulting algorithms) were valid only for at least partially *balanced*
datasets.

(g)lmer() (and its predecessor lme()) of course allow for this use of
data on incompletely documented *subjects*. This is so obvious to users
of "modern" software (i. e. using something else than transcriptions of
manual computation algorithms) that it is no longer mentioned as an
issue. The situation is similar to 2-way fixed-effects ANOVA, where the
"manual" algorithm I was taught ... too long a time ago *demanded* a
balanced datasets, which is a requirement that ended even with BMDP (end
of 70's, IIRC). Similarly, nowhere I'm aware of the authors of lm()
mention that multiway-models do not have to be balanced : that's
granted...

However, what the authors of that papers of yours do *not* mention is
that analyzing this incomplete dataset by using all the complete
*observations*, while better that using only complete *subjects*, might
still lead to biased estimates, and that debiasing them should involve
multiple imputations of missing data in incomplete *observations*. That
subject was well explored by Rubin (since the 80's IIRC), and involves
the specialized packages I mentioned yesterday ... or turning to
Bayesian estimation, which is a horse of an entirely different color.
Look up the 2nd edition of Rubin's book (ca 1998 IIRC) on the subject
for (much) better explanations.

To summarize :

Repeated-measures ANOVA : manually tractable, needs balanced datasets,
therefore forces you to ignore incomplete *subjects*. Obsolete (but
smart an historically important).

Mixed-model ANOVA : manually intractable, accepts unbalanced datasets
but does not allow for partial observations, therefore forces you to
ignore incomplete *observations*. Modern solution, but does not accounts
for possible bias due to missing data.

===== Your paper stops here (and I started here yesterday) =====

Multiple imputations + Mixed-model ANOVA : allows you to use all
available information, estimates the loss of information incurred by
missing data and attempts to make up for it. Best frequentist
(classical) solution, needs specialized software, and might require
special modeling efforts of the missing-data mechanism if "missing at
random" is not "obviously" reasonable.

Bayesian modeling : requires serious efforts to model both the
phenomenon of interest, its covariates and possibly the missing-data
mechanism and a priori information, needs some awareness of the
computational difficulties (not always solvable), current tools not yet
perfected, but is (theoretically) the best possible solution since it
attempts to model the joint distribution of all data, including their
missingness. The answers it leads to (distributions, credible intervals,
Bayes factors, probabilities) have intuitive meanings, quite different
of "frequentist" confidence intervals and p-values, and might not (yet)
be accepted in some circles insisting, for example, on hypothesis
testing.

HTH,

						Emmanuel Charpentier

PS : since "manual" algorithms are out of practical use since the end of
the 70s and the inception of what was then called "personal computers",
I'm a bit surprised that a paper published in 2004 still invokes that
issue... Is your domain special (or especially conservative) ?