Skip to content

Best way to handle missing data?

10 messages · Mitchell Maltenfort, landon hurley, Bonnie Dixon +2 more

#
Dear list;

I am using nlme to create a repeated measures (i.e. 2 level) model.  There
is missing data in several of the predictor variables.  What is the best
way to handle this situation?  The variable with (by far) the most missing
data is the best predictor in the model, so I would not want to remove it.
I am also trying to avoid omitting the observations with missing data,
because that would require omitting almost 40% of the observations and
would result in a substantial loss of power.

A member of my dissertation committee who uses SAS, recommended that I use
full information maximum likelihood estimation (FIML) (described here:
http://www.statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf),
which is the easiest way to handle missing data in SAS.  Is there an
equivalent procedure in R?

Alternatively, I have tried several approaches to multiple imputation.  For
example, I used the package, Amelia, which appears to handle the clustered
structure of the data appropriately, to generate five imputed versions of
the data set, and then used lapply to run my model on each.  But I am not
sure how to combine the resulting five models into one final result.  I
will need a final result that enables me to report, not just the fixed
effects of the model, but also the random effects variance components and,
ideally, the distributions across the population of the random intercept
and slopes, and correlations between them.

Many thanks for any suggestions on how to proceed.

Bonnie
#
Mice might be the package you need
On Thursday, February 26, 2015, Bonnie Dixon <bmdixon at ucdavis.edu> wrote:

            

  
    
#
I actually did try mice also (method "2l.norm"), but it seemed that Amelia
was preferable for imputation.  Mice seems to only be able to impute one
variable, whereas Amelia can impute as many variables as have missing data
producing 100% complete data sets as output.

However, most of the missing data in the data set I am working with is in
just one variable, so I could consider using mice, and just imputing the
variable that has the most missing data, while omitting observations that
have missing data in any of the other variables.  But the pooled results
from mice only seem to include the fixed effects of the model, so this
still leaves me wondering how to report the random effects, which are very
important to my research question.

When using Amelia to impute, the packages Zelig and ZeligMultilevel can be
used to combine the results from each of the models.  But again, only the
fixed effects seem to be included in the output, so I am not sure how to
report on the random effects.

Bonnie

On Thu, Feb 26, 2015 at 8:33 PM, Mitchell Maltenfort <mmalten at gmail.com>
wrote:

  
  
#
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
On 02/26/2015 09:30 PM, Bonnie Dixon wrote:
If you are interested in having maximum likelihood methods, you can use
either ml or reml, specified with the method flag for the nlme command.
However, ml is the default method for estimating parameters for nlme,
and you shouldn't need to do anything at all, outside specify the model.
- From your email, it seems that you are saying that the number of
observations/groups is not reporting the number that you are expecting
there to be though. Is that correct? This is assuming you are content
with the multivariate normal assumption, and are not trying to analyse
discrete outcomes.
Mice will impute the entire dataset. Off hand, I believe the syntax
would look something like mice(data, m= , method= , maxit= ), where m is
the number of independent datasets being imputed (generally you want
25+), maxit being at least 10, and the method being a vector of
character indications of how you want to impute each of the variables,
in the same order that the appear if you use the command names(data). If
you specified 2l.norm, it should have attempted to impute all the
variables using that method, which may not have worked. What mice does
is impute each marginal variable, using the other variables to predict
the true value, done the number of times (random draws using Gibbs
sampling) within each imputation that is specified with the maxit flag,
for m times.

Again, nlme is by default using maximum likelihood though --you
shouldn't need to change anything, as long as you are content with the
MVN and missing at random assumptions for your data.

landon

- -- 
Violence is the last refuge of the incompetent.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iQIcBAEBCgAGBQJU8AJWAAoJEDeph/0fVJWsJ58P/R06GjLjjdaRTJPTT3/6d4xr
EkcQmW1+bH8NZkSBlUzYk/CVmZ/EGK71KIjcdSTzDusAyh9neyXvh5zQiPU287Tl
VRQlOtbLlgoW0rE+x0uFd6PLwsCQRkck2upSU4sCyEpq+/ZSkGUTuE2VsUVCu27y
z4Ecl9sw+s93IpJGj91b9PjdH8g8RysZR7CH/FCfvpzXrRalFTtC75oP8VXEdMWp
rYTqh2/sCds29x/qbS1oxrlWSN0/NuYeTgBE+uCYZ4QxTmQO8JmJA9Sn0k5kKbjU
l1RiZhd48vUj6BFpKCw6HDn1jBVeURXVPlUOBXCFDg13vJBhYdnZAR/nRGQe3dqG
leA/+Ajyyu+fHxlN7T73Nk7nYSM2YfVYJcBT+ALtqf2XWXaHti5rQMi0YaaEI3TN
tTzAEDTjYbt0WCJ4er+pXCcZIVBUoepFH708XFL8LNZ95E/qmsKTTydN+PPmjzIJ
OpGOjDx1Xk0Xc8rKGhAJ/hJbDd7bqmaqrkfa2ydxSd20IPlGMPlx3Fk+2K2l+JyF
qYI7Y3+qGd0YSOGacg+uwEGt6KSEvWsbrx2Vfreifi0p1H4koSySqccaCBvDhVKu
0BBPoG7ErZ0bTpDWQrAChtPAb2jYEbBLCtdKqKezNHFw5/tNEKQFAUvVSu0OByeY
4IG8phi2yApsZ4yEdt/v
=DiLG
-----END PGP SIGNATURE-----
#
mice will impute the complete dataset, it just needs to have an imputation
method setup for each variable. See the example given in the help for
mice.impute.2lonly.norm

Full information maximum likelihood estimation (FIML) (Note for Landon,
this is ML taking into account the missing data) is only feasible if you
can reformulate everything as a structural equation model and use software
that can cope with this. Otherwise working with the integrals is pretty
much impossible. If there is something in the model that is nonlinear it
probably isn't an option at all. One of the great things about multiple
imputation is that you get it running with say 20 imputations and then run
it overnight with 200 or more and it probably won't change but you will
know that you have enough imputations. So FIML doesn't have an advantage in
that respect.
On 27 February 2015 at 16:20, Bonnie Dixon <bmdixon at ucdavis.edu> wrote:

            

  
    
#
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
On 02/27/2015 01:02 AM, Ken Beath wrote:
I'm not sure that's needed as a distinction. This quote from the 	r-help
mailing list [0]  addresses it:
So maybe a semantics difference. However, with respect to the handling
of the integral: if it's problematic, that should result in a
non-convergence problem, or different results reported when he reruns
the model, in terms of diagnostics.

[0]https://stat.ethz.ch/pipermail/r-help/2004-August/056723.html
- -- 
Violence is the last refuge of the incompetent.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iQIcBAEBCgAGBQJU8A5AAAoJEDeph/0fVJWsbNUP/invP0QBC1qS0sWfKrnRVM09
kV1fv4Y8rVflFnS+znsbAPDJOK+5YnvITmfoVLMdwTAWaUEyugKZVGDydY+fTDfg
GxokxDpNAdGlfDBg+asw49VOFoTFtBKai0PWKyw4zHrAHYS9rzTqeO2CVq1Qlb8G
F7je9naYr+iwcEkIWQZ2JloBH8OPw80UueWqNjQ0totVRN8ehYgsu2+iyyudTQnH
Sl7LWkg6QnDYYVKrlV9ygd6z9yOymU9f5w52px1cUIY0mBoT12fYturEfyi/aIxF
+3nBjRCE14C2c9y6mW2Lab9AYpR8bbzsmTK6y7PXid6/VxcqkZlE6Qsj4bD4zvK3
lkIdFj8BR2LdzJNI1EdM8LREA82VPrkS5LFf/4ige0pSo6X3aVoInC2ohLKGSdr5
r66Nh3tLu1a6kPtPBNw7YAxzkzRd2CKy9OTvOpz5wRqlXNvzOoq2Is7Hpoeva0yB
3hvAAgmJUtq8ZbTEXLQiDl2w/qeO+8o5KRfm/2uutN8z29S768me/6bfnvLELw9w
y2R4vwOGdpp+3XBAfs8sF5bMGVvTEzZj/ILph5D7OFRJi/pfCbntnf2mAFrllvlt
KUh+Okd0bO5dC2gfLuu42J3jQnCTMez/ghrEVlXkRX9XMnMz3JB7r4pdgmUqXHYu
w9eXfCoXza9efwhgHF1q
=LMV6
-----END PGP SIGNATURE-----
#
<snip>
*>*missing data treatment is included in the estimation procedure
*>*(parameter estimates are derived from incomplete cases for only the
*>*variables present in the case, rather than simply discarding the
*>*cases), at least in the latent-variable SEM context, specifically in
*>*AMOS.  This may be what Francisco is getting at.
*>>*To my knowledge, no R packages implement this sort of "FIML", for any
*>*class of models, although there are other available missing data
*>*treatments (EM, MCMC estimation). *

*This is what is correctly referred to as FIML. Your original post claimed
that FIML was available through the ML option which is incorrect, and will
not fix missing values except in the dependent variable. The fact that some
software may claim that it does something that it doesn't will not change
this. What could be said is that FIML is simply ML done correctly in that
it builds the proper model for the data, rather than ignoring the
observations with missing data, so both are maximum likelihood. *
On 27 February 2015 at 17:27, landon hurley <ljrhurley at gmail.com> wrote:

            

  
    
#
Thank you very much to everyone who has replied for your helpful
suggestions.

For clarification about FIML (and in support of what Ken explained), my
professor who does multilevel modeling in SAS tells me that in SAS, "FIML"
refers to a form of maximum likelihood estimation that can accept an
incomplete data set, and does not omit the observations with missing data
as must be done in both "ML" and "REML" in nlme.  FIML in SAS handles
observations in which the data is missing for some variables by just using
those variables for which data is available and integrating over the
missing values.  This is the default method in SAS PROC MIXED for all mixed
effects models (not just for structural equation modeling).  But this
functionality does not appear to be available in R except for structural
equation modeling (i.e. package, lavaan).

Given that, I am now working on a multiple imputation solution for my
problem, using either mice or Amelia, and will post again to the list once
I have a working example.  (Apparently, I was wrong about mice only being
able to impute one variable.)  How many imputations are needed?  Many
sources online indicate that 3-10 is usually enough, and the default in
both mice and Amelia is 5.

Bonnie
On Thu, Feb 26, 2015 at 11:26 PM, Ken Beath <ken.beath at mq.edu.au> wrote:

            
#
On 28 February 2015 at 07:00, Bonnie Dixon <bmdixon at ucdavis.edu> wrote:

            
Others claim 20, and that seems to be more than sufficient for a lot of
problems. It will depend on what proportion of your data is missing, and
how dependent the outcome is on these. As you generally can't have too many
then I would start with say 20 and then try a couple of larger number and
if there is no change then 20 was sufficient.
#
I hate to be so blunt here, but this is just flat out wrong. proc mixed is great and all, but it doesn't do such a thing. Just like lmer() and lme() (with na.action=na.omit), proc mixed will just delete rows with missing data and then use ML or REML estimation on what's left (which is perfectly fine under certain missing data mechanisms). Consequently, fitting the same model with proc mixed and lmer() or lme() to the same data with missing data yields essentially identical results. One can easily confirm this with a few examples.
Indeed, one has to switch to some form of a latent variable model if one wants to use FIML. In R, one should look into 'lavaan' or 'sem' (or 'OpenMX' for the more adventurous). In SAS, one would need to use something like proc calis:

http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/statug_calis_sect103.htm

Again, proc mixed does not use FIML. I am really just repeating what Ken has already stated. Also relevant:

http://stats.stackexchange.com/questions/51006/full-information-maximum-likelihood-for-missing-data-in-r

Best,
Wolfgang