Best way to handle missing data? - R-SIG-mixed-models

Sun, Mar 1, 2015 4:00 PM #

Thank you for this clarification.  I can see from studying the article
linked below more closely that it confirms what you have said.
http://www.statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf

The distinction seems to be between missing data in the dependent variable
(which SAS PROC MIXED handles automatically) versus missing data in a
predictor variable (which would require switching to a structural equation
modeling program, such as SAS PROC CALIS to handle automatically using
FIML).  Here is a quote from the conclusion of the article that explains
this:

"When estimating mixed models for repeated measurements, PROC MIXED and
PROC GLIMMIX automatically handle missing data by maximum likelihood, as
long as there are no missing data on predictor variables. When data are
missing on both predictor and dependent variables, PROC CALIS can do
maximum likelihood for a large class of linear models..."

This sounds approximately equivalent to the functionality available in R.

I don't think the model I am working on is a good candidate for structural
equation modeling because the data set is very unbalanced (ie. there are
very different numbers of observations for different people, taken at
different times), the main relationship of interest involves a time-varying
predictor, and one of the variables with missing data is not continuous (it
is a binary, categorical variable).  So, I will stick with the multiple
imputation approach for handling the missing data.

Bonnie


On Fri, Feb 27, 2015 at 4:22 PM, Viechtbauer Wolfgang (STAT) <

wolfgang.viechtbauer at maastrichtuniversity.nl> wrote:

David Duffy

Sun, Mar 1, 2015 5:29 PM #

On Mon, 2 Mar 2015, Bonnie Dixon wrote:

As Wolfgang mentioned, OpenMX can fit a FIML analysis to irregular data. 
If you were, for example, interested in a profile likelihood around a 
variance component, that might be the way to go.  It seems to me that 
multiple imputation might not always respect complicated 
clustering/correlation, depending on the actual method. A quick search 
found some cautionary tales in:

http://www.bmj.com/content/338/bmj.b2393.extract

Just another 2c, David.


| David Duffy (MBBS PhD)
| email: David.Duffy at qimrberghofer.edu.au  ph: INT+61+7+3362-0217 fax: -0101
| Genetic Epidemiology, QIMR Berghofer Institute of Medical Research
| 300 Herston Rd, Brisbane, Queensland 4006, Australia  GPG 4D0B994A

Joseph Bulbulia

Mon, Mar 2, 2015 4:03 AM #

RELATED QUESTION
I have a related and probably naive question, but raising it might be helpful to Bonnie and others (myself included) who are struggling with multiple-imputaton in a mixed-effects modeling setting.

FIRST, MY DISCOMFORT
The question arises from (1) my discomfort with averaging across multiply imputed datasets, which seems to lose the uncertainty from the data-generating imputation process (2) my need to use a wider class of models than is made available by Zelig ? such as MCMCglmm.

NOTE
I realise that MCMCglmm can handle missing variables (MAR) as outcome variables, but where many columns have missing values, the resulting multivariate outcome model will often becomes overly complex.

THE QUESTION
To avoid averaging, if multiple data sets were generated (assume sensibly) through a multiple imputation algorithm (say using the Amelia package), would it make any sense to combine the datasets (e.g. using r-bind) with an indicator for each of the imputed datasets, and then to model each specific imputed dataset as a random effect in, say, MCMCglmm?

REASONING
If the observations from the datasets were conceived as measurements on individuals (also included as an effect modelled as random). Then conceptually it seems you would be adjusting your expectation for the variation of multiple observations within individuals from the multiply imputed datasets. Where there is no imputation, the observed values remain constant, and part of me thinks this constancy of observations within individuals shouldn?t effect the estimates... I think?

SNAG
On the other hand, just combining datasets with an indicator for each dataset would artificially (and often dramatically) increase the number of observations, which might not be handled adequately by the G/ R structures.

APOLOGY
I apologise if this question makes little sense, or if the answer is just plain obvious. I?d intended to ask a statistician at work, and to simulate some data with him, but the topic came up here, and I figured others might benefit, in case others had the same (potentially naive) thought, and the experts have a quick answer, even if the answer is ?you are muddled.?

Cheers,

Joseph

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

Viechtbauer Wolfgang (STAT)

Tue, Mar 3, 2015 2:16 AM #

With MI, you do indeed average parameter estimates across the imputed datasets. And the way the SE for such an average is computed takes into consideration not only the variance of the estimate conditional on a particular dataset but also the variability across datasets. That's in fact the entire point of doing the imputation multiple times.

See, for example: http://sites.stat.psu.edu/~jls/mifaq.html#howto

One can apply that principle to any parameter estimate, even if this computation is not automated for particular models via a package.

Best,
Wolfgang

-----Original Message-----
From: R-sig-mixed-models [mailto:r-sig-mixed-models-bounces at r-
project.org] On Behalf Of Joseph Bulbulia
Sent: Monday, March 02, 2015 13:04
To: David Duffy
Cc: r-sig-mixed-models at r-project.org
Subject: Re: [R-sig-ME] Best way to handle missing data?

RELATED QUESTION
I have a related and probably naive question, but raising it might be
helpful to Bonnie and others (myself included) who are struggling with
multiple-imputaton in a mixed-effects modeling setting.

FIRST, MY DISCOMFORT
The question arises from (1) my discomfort with averaging across multiply
imputed datasets, which seems to lose the uncertainty from the data-
generating imputation process (2) my need to use a wider class of models
than is made available by Zelig ? such as MCMCglmm.

NOTE
I realise that MCMCglmm can handle missing variables (MAR) as outcome
variables,  but where many columns have missing values, the resulting
multivariate outcome model will often becomes overly complex.

THE QUESTION
To avoid averaging, if multiple data sets were generated (assume
sensibly) through a multiple imputation algorithm (say using the Amelia
package), would it make any sense to combine the datasets (e.g. using r-
bind) with an indicator for each of the imputed datasets, and then to
model each specific imputed dataset as a random effect in, say,
MCMCglmm?

REASONING
If the observations from the datasets were conceived as measurements on
individuals (also included as an effect modelled as random).  Then
conceptually it seems you would be adjusting your expectation for the
variation of multiple observations within individuals from the multiply
imputed datasets. Where there is no imputation, the observed values
remain constant, and part of me thinks this constancy of observations
within individuals shouldn?t effect the estimates... I think?

SNAG
On the other hand, just combining datasets with an indicator for each
dataset would artificially (and often dramatically) increase the number
of observations, which might not be handled adequately by the G/ R
structures.

APOLOGY
I apologise if this question makes little sense, or if the answer is just
plain obvious.  I?d intended to ask a statistician at work, and to
simulate some data with him,  but the topic came up here, and I figured
others might benefit, in case others had the same (potentially naive)
thought, and the experts have a quick answer, even if the answer is ?you
are muddled.?

Cheers,

Joseph