Skip to content

Dealing with NAs in LMER with longitudinal data (Re Crime and Education data)

3 messages · Ades, James, David Duffy, Dimitris Rizopoulos

#
I?ve often heard that mixed-effect lmer/glmer models ?handle? or ?deal? with NA values well, and I?ve become more curious about what this actually means, if it is, indeed, true. What I?ve observed working with mixed-effect models is that na.omit will delete the entire row of observations, and depending on the number of NAs, the AIC might deceptively, dramatically decrease, given that the sample is smaller.

I know that one can also use ?na.pass??maybe this is what I?ve heard in the past with regard to lmer handling NAs well(?)?though I?ve often found that this doesn?t always work, throwing back the error that ```Error in qr.default(X, tol = tol, LAPACK = FALSE) : NA/NaN/Inf in foreign function call (arg 1)```. When it does work, I?m not sure how it works. I looked through the lme4 manual and the ?Fitting Linear Mixed-Effects Models? article, but I couldn?t find anything.

I?d assume that imputation is better practice for handling NAs. Though, specifically referencing my crime/ed analysis (I?ve posted the data here: https://drive.google.com/open?id=1wRwLqCKNfpz5aHtyy5KfY07_RFqWsWv9) this is a bit more difficult, and something I have yet to do. I?ve been reading about it here: https://stefvanbuuren.name/fimd/sec-rastering.html.

In addition, there are instances where data is only offered every five years, or, as is the case with a presidential election, every four years. My ?bandaid? approach for this kind of data pitfall is to stagger the four years, so that the election data counts for the two years preceding and the two years following the election (this is an assumption, but it seems preferable to NAs for three out of four years). 

Still, it seems that weirdness might be accompanying this method. Looking at educational attainment data (averaged over a five-year period) in the dataset, there exists unseemly high correlation between year and the proportion of people in a place and their corresponding educational attainment (some high school, hs diploma, some college, bachelors, MA,etc.); these individual variables have anywhere from a  -.5 to a .6 correlation with year. 

Code for looking at correlations:
```cor.total.years.city <- total.years.city.select%>%select((3), (8:31))%>%na.omit()
cor1 = cor(cor.total.years.city)
corrplot.mixed(cor1, lower.col = "black", number.cex = .7)```

Perhaps I should put these variables into into long format, but I?ve read that sometimes this exacerbates multi-collinearity. (And this wouldn?t solve the correlation strangeness)

To summarize: 
1. If lmer does handle NAs well, how exactly is it doing that? If ?na.pass? fails, then is it handling NAs as any other program?
2. Is imputation (done correctly) better than allowing mixed-effect functions to handle NAs?
3. Any specific resources on imputing longitudinal data?
4. For data offered every four years, is my method of staggering (and filling) this data sufficient? Is there another way I should be thinking about this in lme4? Is this the source of funky correlations between education attainment and year?
5. Should I be using long format here for variables like race (black, white, asian, latino) and education attainment (some high school, hs diploma, some college, bachelors, MA/grad school)

Thanks much!

James
1 day later
#
My limited understanding is that na.pass usually affects just the copy of the data in
the returned object. It won't get around the fact that if you are conditioning on fixed effects, only complete observations must be used. So if you want your AICs to be comparable, you need to have a single dataset that is complete for all the variables you are interested in.
If you have non-ignorable missing data, then these must be included as response variables, so the mixed model can combine the correct likelihoods for each pattern of missingness. I have more experience with a straightforward multivariate formulation for this, so I don't know how or if you can mimic this in the lmer framework. Quite aside from if you want to specify directional paths between such variables - imputation is the cheap and cheerful answer.
I'd of thought so, unless you already have a handle on the causes of any autocorrelation

Hopefully someone more in your area will respond, but in animal breeding genetics, there are mixed models of similar huge longitudinal datasets (people I know in human genetics were great fans of the Journal of Dairy Science ;), and of ASReml).

Cheers, David Duffy.
#
We should distinguish between missing data in the outcome and missing data in the covariates.

For missing data in the outcome, mixed effects models provide unbiased estimates and valid inferences under the missing completely at random and missing at random missing data mechanisms. No (multiple) imputation of the outcome is required in this case. Only that the model is adequately/flexibly specified with regard to both the fixed- and random-effects structures. For the fixed-effects part in particular you need to include any covariates that potentially relate to the reasons why you have missing data. Finally, if the missing data mechanism is missing not at random, then the mixed model alone is not enough and you will need to jointly model the outcome and the dropout process.

For missing data in the covariates you will need to use multiple imputation. It is important that the whole outcome is included in the imputation step. This is more challenging for example for longitudinal outcomes that are not measured at the same time points for all subjects. There are approaches and R packages to handle these situations.

Best,
Dimitris


From: David Duffy <David.Duffy at qimrberghofer.edu.au<mailto:David.Duffy at qimrberghofer.edu.au>>
Date: Tuesday, 17 Sep 2019, 06:03
To: Ades, James <jades at ucsd.edu<mailto:jades at ucsd.edu>>, r-sig-mixed-models at r-project.org <r-sig-mixed-models at r-project.org<mailto:r-sig-mixed-models at r-project.org>>
Subject: Re: [R-sig-ME] Dealing with NAs in LMER with longitudinal data (Re Crime and Education data)
My limited understanding is that na.pass usually affects just the copy of the data in
the returned object. It won't get around the fact that if you are conditioning on fixed effects, only complete observations must be used. So if you want your AICs to be comparable, you need to have a single dataset that is complete for all the variables you are interested in.
If you have non-ignorable missing data, then these must be included as response variables, so the mixed model can combine the correct likelihoods for each pattern of missingness. I have more experience with a straightforward multivariate formulation for this, so I don't know how or if you can mimic this in the lmer framework. Quite aside from if you want to specify directional paths between such variables - imputation is the cheap and cheerful answer.
I'd of thought so, unless you already have a handle on the causes of any autocorrelation

Hopefully someone more in your area will respond, but in animal breeding genetics, there are mixed models of similar huge longitudinal datasets (people I know in human genetics were great fans of the Journal of Dairy Science ;), and of ASReml).

Cheers, David Duffy.
_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-mixed-models&amp;data=02%7C01%7Cd.rizopoulos%40erasmusmc.nl%7C2a138361e3764a450a7908d73b23f0b0%7C526638ba6af34b0fa532a1a511f4ac80%7C0%7C1%7C637042897872673826&amp;sdata=ytckzuyEucwzIAQRYmnZRdzlOO1LBywRjjGUZHwvV5Q%3D&amp;reserved=0