Skip to content

Best way to handle missing data?

4 messages · Bonnie Dixon, David Duffy, Joseph Bulbulia +1 more

#
Thank you for this clarification.  I can see from studying the article
linked below more closely that it confirms what you have said.
http://www.statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf

The distinction seems to be between missing data in the dependent variable
(which SAS PROC MIXED handles automatically) versus missing data in a
predictor variable (which would require switching to a structural equation
modeling program, such as SAS PROC CALIS to handle automatically using
FIML).  Here is a quote from the conclusion of the article that explains
this:

"When estimating mixed models for repeated measurements, PROC MIXED and
PROC GLIMMIX automatically handle missing data by maximum likelihood, as
long as there are no missing data on predictor variables. When data are
missing on both predictor and dependent variables, PROC CALIS can do
maximum likelihood for a large class of linear models..."

This sounds approximately equivalent to the functionality available in R.

I don't think the model I am working on is a good candidate for structural
equation modeling because the data set is very unbalanced (ie. there are
very different numbers of observations for different people, taken at
different times), the main relationship of interest involves a time-varying
predictor, and one of the variables with missing data is not continuous (it
is a binary, categorical variable).  So, I will stick with the multiple
imputation approach for handling the missing data.

Bonnie


On Fri, Feb 27, 2015 at 4:22 PM, Viechtbauer Wolfgang (STAT) <
wolfgang.viechtbauer at maastrichtuniversity.nl> wrote:

            

  
  
#
On Mon, 2 Mar 2015, Bonnie Dixon wrote:

            
As Wolfgang mentioned, OpenMX can fit a FIML analysis to irregular data. 
If you were, for example, interested in a profile likelihood around a 
variance component, that might be the way to go.  It seems to me that 
multiple imputation might not always respect complicated 
clustering/correlation, depending on the actual method. A quick search 
found some cautionary tales in:

http://www.bmj.com/content/338/bmj.b2393.extract

Just another 2c, David.


| David Duffy (MBBS PhD)
| email: David.Duffy at qimrberghofer.edu.au  ph: INT+61+7+3362-0217 fax: -0101
| Genetic Epidemiology, QIMR Berghofer Institute of Medical Research
| 300 Herston Rd, Brisbane, Queensland 4006, Australia  GPG 4D0B994A
#
RELATED QUESTION
I have a related and probably naive question, but raising it might be helpful to Bonnie and others (myself included) who are struggling with multiple-imputaton in a mixed-effects modeling setting. 

FIRST, MY DISCOMFORT
The question arises from (1) my discomfort with averaging across multiply imputed datasets, which seems to lose the uncertainty from the data-generating imputation process (2) my need to use a wider class of models than is made available by Zelig ? such as MCMCglmm. 

NOTE
I realise that MCMCglmm can handle missing variables (MAR) as outcome variables,  but where many columns have missing values, the resulting multivariate outcome model will often becomes overly complex.    

THE QUESTION
To avoid averaging, if multiple data sets were generated (assume sensibly) through a multiple imputation algorithm (say using the Amelia package), would it make any sense to combine the datasets (e.g. using r-bind) with an indicator for each of the imputed datasets, and then to model each specific imputed dataset as a random effect in, say,  MCMCglmm?

REASONING 
If the observations from the datasets were conceived as measurements on individuals (also included as an effect modelled as random).  Then conceptually it seems you would be adjusting your expectation for the variation of multiple observations within individuals from the multiply imputed datasets. Where there is no imputation, the observed values remain constant, and part of me thinks this constancy of observations within individuals shouldn?t effect the estimates... I think?   

SNAG
On the other hand, just combining datasets with an indicator for each dataset would artificially (and often dramatically) increase the number of observations, which might not be handled adequately by the G/ R structures.    


APOLOGY
I apologise if this question makes little sense, or if the answer is just plain obvious.  I?d intended to ask a statistician at work, and to simulate some data with him,  but the topic came up here, and I figured others might benefit, in case others had the same (potentially naive) thought, and the experts have a quick answer, even if the answer is ?you are muddled.? 

Cheers, 

Joseph

  
  
#
With MI, you do indeed average parameter estimates across the imputed datasets. And the way the SE for such an average is computed takes into consideration not only the variance of the estimate conditional on a particular dataset but also the variability across datasets. That's in fact the entire point of doing the imputation multiple times.

See, for example: http://sites.stat.psu.edu/~jls/mifaq.html#howto

One can apply that principle to any parameter estimate, even if this computation is not automated for particular models via a package.

Best,
Wolfgang