Dear list; I am using nlme to create a repeated measures (i.e. 2 level) model. There is missing data in several of the predictor variables. What is the best way to handle this situation? The variable with (by far) the most missing data is the best predictor in the model, so I would not want to remove it. I am also trying to avoid omitting the observations with missing data, because that would require omitting almost 40% of the observations and would result in a substantial loss of power. A member of my dissertation committee who uses SAS, recommended that I use full information maximum likelihood estimation (FIML) (described here: http://www.statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf), which is the easiest way to handle missing data in SAS. Is there an equivalent procedure in R? Alternatively, I have tried several approaches to multiple imputation. For example, I used the package, Amelia, which appears to handle the clustered structure of the data appropriately, to generate five imputed versions of the data set, and then used lapply to run my model on each. But I am not sure how to combine the resulting five models into one final result. I will need a final result that enables me to report, not just the fixed effects of the model, but also the random effects variance components and, ideally, the distributions across the population of the random intercept and slopes, and correlations between them. Many thanks for any suggestions on how to proceed. Bonnie
Best way to handle missing data?
10 messages · Mitchell Maltenfort, landon hurley, Bonnie Dixon +2 more
Mice might be the package you need
On Thursday, February 26, 2015, Bonnie Dixon <bmdixon at ucdavis.edu> wrote:
Dear list; I am using nlme to create a repeated measures (i.e. 2 level) model. There is missing data in several of the predictor variables. What is the best way to handle this situation? The variable with (by far) the most missing data is the best predictor in the model, so I would not want to remove it. I am also trying to avoid omitting the observations with missing data, because that would require omitting almost 40% of the observations and would result in a substantial loss of power. A member of my dissertation committee who uses SAS, recommended that I use full information maximum likelihood estimation (FIML) (described here: http://www.statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf ), which is the easiest way to handle missing data in SAS. Is there an equivalent procedure in R? Alternatively, I have tried several approaches to multiple imputation. For example, I used the package, Amelia, which appears to handle the clustered structure of the data appropriately, to generate five imputed versions of the data set, and then used lapply to run my model on each. But I am not sure how to combine the resulting five models into one final result. I will need a final result that enables me to report, not just the fixed effects of the model, but also the random effects variance components and, ideally, the distributions across the population of the random intercept and slopes, and correlations between them. Many thanks for any suggestions on how to proceed. Bonnie [[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org <javascript:;> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
____________________________ Ersatzistician and Chutzpahthologist I can answer any question. "I don't know" is an answer. "I don't know yet" is a better answer. "I can write better than anybody who can write faster, and I can write faster than anybody who can write better" AJ Liebling [[alternative HTML version deleted]]
I actually did try mice also (method "2l.norm"), but it seemed that Amelia was preferable for imputation. Mice seems to only be able to impute one variable, whereas Amelia can impute as many variables as have missing data producing 100% complete data sets as output. However, most of the missing data in the data set I am working with is in just one variable, so I could consider using mice, and just imputing the variable that has the most missing data, while omitting observations that have missing data in any of the other variables. But the pooled results from mice only seem to include the fixed effects of the model, so this still leaves me wondering how to report the random effects, which are very important to my research question. When using Amelia to impute, the packages Zelig and ZeligMultilevel can be used to combine the results from each of the models. But again, only the fixed effects seem to be included in the output, so I am not sure how to report on the random effects. Bonnie On Thu, Feb 26, 2015 at 8:33 PM, Mitchell Maltenfort <mmalten at gmail.com> wrote:
Mice might be the package you need On Thursday, February 26, 2015, Bonnie Dixon <bmdixon at ucdavis.edu> wrote:
Dear list; I am using nlme to create a repeated measures (i.e. 2 level) model. There is missing data in several of the predictor variables. What is the best way to handle this situation? The variable with (by far) the most missing data is the best predictor in the model, so I would not want to remove it. I am also trying to avoid omitting the observations with missing data, because that would require omitting almost 40% of the observations and would result in a substantial loss of power. A member of my dissertation committee who uses SAS, recommended that I use full information maximum likelihood estimation (FIML) (described here: http://www.statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf ), which is the easiest way to handle missing data in SAS. Is there an equivalent procedure in R? Alternatively, I have tried several approaches to multiple imputation. For example, I used the package, Amelia, which appears to handle the clustered structure of the data appropriately, to generate five imputed versions of the data set, and then used lapply to run my model on each. But I am not sure how to combine the resulting five models into one final result. I will need a final result that enables me to report, not just the fixed effects of the model, but also the random effects variance components and, ideally, the distributions across the population of the random intercept and slopes, and correlations between them. Many thanks for any suggestions on how to proceed. Bonnie [[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
--
____________________________ Ersatzistician and Chutzpahthologist I can answer any question. "I don't know" is an answer. "I don't know yet" is a better answer. "I can write better than anybody who can write faster, and I can write faster than anybody who can write better" AJ Liebling
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
On 02/26/2015 09:30 PM, Bonnie Dixon wrote:
Dear list; A member of my dissertation committee who uses SAS, recommended that I use full information maximum likelihood estimation (FIML) (described here: http://www.statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf), which is the easiest way to handle missing data in SAS. Is there an equivalent procedure in R?
If you are interested in having maximum likelihood methods, you can use either ml or reml, specified with the method flag for the nlme command. However, ml is the default method for estimating parameters for nlme, and you shouldn't need to do anything at all, outside specify the model. - From your email, it seems that you are saying that the number of observations/groups is not reporting the number that you are expecting there to be though. Is that correct? This is assuming you are content with the multivariate normal assumption, and are not trying to analyse discrete outcomes.
I actually did try mice also (method "2l.norm"), but it seemed that Amelia was preferable for imputation. Mice seems to only be able to impute one variable, whereas Amelia can impute as many variables as have missing data producing 100% complete data sets as output.
Mice will impute the entire dataset. Off hand, I believe the syntax would look something like mice(data, m= , method= , maxit= ), where m is the number of independent datasets being imputed (generally you want 25+), maxit being at least 10, and the method being a vector of character indications of how you want to impute each of the variables, in the same order that the appear if you use the command names(data). If you specified 2l.norm, it should have attempted to impute all the variables using that method, which may not have worked. What mice does is impute each marginal variable, using the other variables to predict the true value, done the number of times (random draws using Gibbs sampling) within each imputation that is specified with the maxit flag, for m times. Again, nlme is by default using maximum likelihood though --you shouldn't need to change anything, as long as you are content with the MVN and missing at random assumptions for your data. landon - -- Violence is the last refuge of the incompetent. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAEBCgAGBQJU8AJWAAoJEDeph/0fVJWsJ58P/R06GjLjjdaRTJPTT3/6d4xr EkcQmW1+bH8NZkSBlUzYk/CVmZ/EGK71KIjcdSTzDusAyh9neyXvh5zQiPU287Tl VRQlOtbLlgoW0rE+x0uFd6PLwsCQRkck2upSU4sCyEpq+/ZSkGUTuE2VsUVCu27y z4Ecl9sw+s93IpJGj91b9PjdH8g8RysZR7CH/FCfvpzXrRalFTtC75oP8VXEdMWp rYTqh2/sCds29x/qbS1oxrlWSN0/NuYeTgBE+uCYZ4QxTmQO8JmJA9Sn0k5kKbjU l1RiZhd48vUj6BFpKCw6HDn1jBVeURXVPlUOBXCFDg13vJBhYdnZAR/nRGQe3dqG leA/+Ajyyu+fHxlN7T73Nk7nYSM2YfVYJcBT+ALtqf2XWXaHti5rQMi0YaaEI3TN tTzAEDTjYbt0WCJ4er+pXCcZIVBUoepFH708XFL8LNZ95E/qmsKTTydN+PPmjzIJ OpGOjDx1Xk0Xc8rKGhAJ/hJbDd7bqmaqrkfa2ydxSd20IPlGMPlx3Fk+2K2l+JyF qYI7Y3+qGd0YSOGacg+uwEGt6KSEvWsbrx2Vfreifi0p1H4koSySqccaCBvDhVKu 0BBPoG7ErZ0bTpDWQrAChtPAb2jYEbBLCtdKqKezNHFw5/tNEKQFAUvVSu0OByeY 4IG8phi2yApsZ4yEdt/v =DiLG -----END PGP SIGNATURE-----
mice will impute the complete dataset, it just needs to have an imputation method setup for each variable. See the example given in the help for mice.impute.2lonly.norm Full information maximum likelihood estimation (FIML) (Note for Landon, this is ML taking into account the missing data) is only feasible if you can reformulate everything as a structural equation model and use software that can cope with this. Otherwise working with the integrals is pretty much impossible. If there is something in the model that is nonlinear it probably isn't an option at all. One of the great things about multiple imputation is that you get it running with say 20 imputations and then run it overnight with 200 or more and it probably won't change but you will know that you have enough imputations. So FIML doesn't have an advantage in that respect.
On 27 February 2015 at 16:20, Bonnie Dixon <bmdixon at ucdavis.edu> wrote:
I actually did try mice also (method "2l.norm"), but it seemed that Amelia was preferable for imputation. Mice seems to only be able to impute one variable, whereas Amelia can impute as many variables as have missing data producing 100% complete data sets as output. However, most of the missing data in the data set I am working with is in just one variable, so I could consider using mice, and just imputing the variable that has the most missing data, while omitting observations that have missing data in any of the other variables. But the pooled results from mice only seem to include the fixed effects of the model, so this still leaves me wondering how to report the random effects, which are very important to my research question. When using Amelia to impute, the packages Zelig and ZeligMultilevel can be used to combine the results from each of the models. But again, only the fixed effects seem to be included in the output, so I am not sure how to report on the random effects. Bonnie On Thu, Feb 26, 2015 at 8:33 PM, Mitchell Maltenfort <mmalten at gmail.com> wrote:
Mice might be the package you need On Thursday, February 26, 2015, Bonnie Dixon <bmdixon at ucdavis.edu>
wrote:
Dear list; I am using nlme to create a repeated measures (i.e. 2 level) model.
There
is missing data in several of the predictor variables. What is the best way to handle this situation? The variable with (by far) the most
missing
data is the best predictor in the model, so I would not want to remove
it.
I am also trying to avoid omitting the observations with missing data, because that would require omitting almost 40% of the observations and would result in a substantial loss of power. A member of my dissertation committee who uses SAS, recommended that I
use
full information maximum likelihood estimation (FIML) (described here:
), which is the easiest way to handle missing data in SAS. Is there an equivalent procedure in R? Alternatively, I have tried several approaches to multiple imputation. For example, I used the package, Amelia, which appears to handle the
clustered
structure of the data appropriately, to generate five imputed versions
of
the data set, and then used lapply to run my model on each. But I am
not
sure how to combine the resulting five models into one final result. I will need a final result that enables me to report, not just the fixed effects of the model, but also the random effects variance components
and,
ideally, the distributions across the population of the random intercept
and slopes, and correlations between them.
Many thanks for any suggestions on how to proceed.
Bonnie
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
--
____________________________ Ersatzistician and Chutzpahthologist I can answer any question. "I don't know" is an answer. "I don't know yet" is a better answer. "I can write better than anybody who can write faster, and I can write faster than anybody who can write better" AJ Liebling
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
*Ken Beath* Lecturer Statistics Department MACQUARIE UNIVERSITY NSW 2109, Australia Phone: +61 (0)2 9850 8516 Building E4A, room 526 http://stat.mq.edu.au/our_staff/staff_-_alphabetical/staff/beath,_ken/ CRICOS Provider No 00002J This message is intended for the addressee named and may...{{dropped:9}}
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
On 02/27/2015 01:02 AM, Ken Beath wrote:
mice will impute the complete dataset, it just needs to have an imputation method setup for each variable. See the example given in the help for mice.impute.2lonly.norm Full information maximum likelihood estimation (FIML) (Note for Landon, this is ML taking into account the missing data) is only feasible if you can reformulate everything as a structural equation model and use software that can cope with this. Otherwise working with the integrals is pretty much impossible. If there is something in the model that is nonlinear it probably isn't an option at all. One of the great things about multiple imputation is that you get it running with say 20 imputations and then run it overnight with 200 or more and it probably won't change but you will know that you have enough imputations. So FIML doesn't have an advantage in that respect.
I'm not sure that's needed as a distinction. This quote from the r-help mailing list [0] addresses it:
I'm not sure you are correct on this. Other texts on multilevel models (e.g., Raudenbush and Bryk, Kreft and Deeuw, and Singer & Willett) all use FiML as a synonym for ML. In fact, Kreft and Deleeuw go as far to even state they are the same thing (see page 131). When you run a model in HLM selecting "Full Maximum Likelihood" and method="ML" in lme, the results, including all fixed effects, variance components, empirical bayes residuals, degrees of freedom are exactly the same. So, I think Doug [Bates] is correct in that ML == FiML. Harold
So maybe a semantics difference. However, with respect to the handling of the integral: if it's problematic, that should result in a non-convergence problem, or different results reported when he reruns the model, in terms of diagnostics. [0]https://stat.ethz.ch/pipermail/r-help/2004-August/056723.html
On 27 February 2015 at 16:20, Bonnie Dixon <bmdixon at ucdavis.edu> wrote:
I actually did try mice also (method "2l.norm"), but it seemed that Amelia was preferable for imputation. Mice seems to only be able to impute one variable, whereas Amelia can impute as many variables as have missing data producing 100% complete data sets as output. However, most of the missing data in the data set I am working with is in just one variable, so I could consider using mice, and just imputing the variable that has the most missing data, while omitting observations that have missing data in any of the other variables. But the pooled results from mice only seem to include the fixed effects of the model, so this still leaves me wondering how to report the random effects, which are very important to my research question. When using Amelia to impute, the packages Zelig and ZeligMultilevel can be used to combine the results from each of the models. But again, only the fixed effects seem to be included in the output, so I am not sure how to report on the random effects. Bonnie On Thu, Feb 26, 2015 at 8:33 PM, Mitchell Maltenfort <mmalten at gmail.com> wrote:
Mice might be the package you need On Thursday, February 26, 2015, Bonnie Dixon <bmdixon at ucdavis.edu>
wrote:
Dear list; I am using nlme to create a repeated measures (i.e. 2 level) model.
There
is missing data in several of the predictor variables. What is the best way to handle this situation? The variable with (by far) the most
missing
data is the best predictor in the model, so I would not want to remove
it.
I am also trying to avoid omitting the observations with missing data, because that would require omitting almost 40% of the observations and would result in a substantial loss of power. A member of my dissertation committee who uses SAS, recommended that I
use
full information maximum likelihood estimation (FIML) (described here:
), which is the easiest way to handle missing data in SAS. Is there an equivalent procedure in R? Alternatively, I have tried several approaches to multiple imputation. For example, I used the package, Amelia, which appears to handle the
clustered
structure of the data appropriately, to generate five imputed versions
of
the data set, and then used lapply to run my model on each. But I am
not
sure how to combine the resulting five models into one final result. I will need a final result that enables me to report, not just the fixed effects of the model, but also the random effects variance components
and,
ideally, the distributions across the population of the random intercept
and slopes, and correlations between them.
Many thanks for any suggestions on how to proceed.
Bonnie
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
--
____________________________ Ersatzistician and Chutzpahthologist I can answer any question. "I don't know" is an answer. "I don't know yet" is a better answer. "I can write better than anybody who can write faster, and I can write faster than anybody who can write better" AJ Liebling
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
- -- Violence is the last refuge of the incompetent. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAEBCgAGBQJU8A5AAAoJEDeph/0fVJWsbNUP/invP0QBC1qS0sWfKrnRVM09 kV1fv4Y8rVflFnS+znsbAPDJOK+5YnvITmfoVLMdwTAWaUEyugKZVGDydY+fTDfg GxokxDpNAdGlfDBg+asw49VOFoTFtBKai0PWKyw4zHrAHYS9rzTqeO2CVq1Qlb8G F7je9naYr+iwcEkIWQZ2JloBH8OPw80UueWqNjQ0totVRN8ehYgsu2+iyyudTQnH Sl7LWkg6QnDYYVKrlV9ygd6z9yOymU9f5w52px1cUIY0mBoT12fYturEfyi/aIxF +3nBjRCE14C2c9y6mW2Lab9AYpR8bbzsmTK6y7PXid6/VxcqkZlE6Qsj4bD4zvK3 lkIdFj8BR2LdzJNI1EdM8LREA82VPrkS5LFf/4ige0pSo6X3aVoInC2ohLKGSdr5 r66Nh3tLu1a6kPtPBNw7YAxzkzRd2CKy9OTvOpz5wRqlXNvzOoq2Is7Hpoeva0yB 3hvAAgmJUtq8ZbTEXLQiDl2w/qeO+8o5KRfm/2uutN8z29S768me/6bfnvLELw9w y2R4vwOGdpp+3XBAfs8sF5bMGVvTEzZj/ILph5D7OFRJi/pfCbntnf2mAFrllvlt KUh+Okd0bO5dC2gfLuu42J3jQnCTMez/ghrEVlXkRX9XMnMz3JB7r4pdgmUqXHYu w9eXfCoXza9efwhgHF1q =LMV6 -----END PGP SIGNATURE-----
From the same posting
*From: Chris Lawrence <chris at lordsutch.com <https://stat.ethz.ch/mailman/listinfo/r-help>>*
<snip>
*I have seen FIML used to refer to a type of ML estimation where a
*>*missing data treatment is included in the estimation procedure *>*(parameter estimates are derived from incomplete cases for only the *>*variables present in the case, rather than simply discarding the *>*cases), at least in the latent-variable SEM context, specifically in *>*AMOS. This may be what Francisco is getting at. *>>*To my knowledge, no R packages implement this sort of "FIML", for any *>*class of models, although there are other available missing data *>*treatments (EM, MCMC estimation). * *This is what is correctly referred to as FIML. Your original post claimed that FIML was available through the ML option which is incorrect, and will not fix missing values except in the dependent variable. The fact that some software may claim that it does something that it doesn't will not change this. What could be said is that FIML is simply ML done correctly in that it builds the proper model for the data, rather than ignoring the observations with missing data, so both are maximum likelihood. *
On 27 February 2015 at 17:27, landon hurley <ljrhurley at gmail.com> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 02/27/2015 01:02 AM, Ken Beath wrote:
mice will impute the complete dataset, it just needs to have an
imputation
method setup for each variable. See the example given in the help for mice.impute.2lonly.norm Full information maximum likelihood estimation (FIML) (Note for Landon, this is ML taking into account the missing data) is only feasible if you can reformulate everything as a structural equation model and use
software
that can cope with this. Otherwise working with the integrals is pretty much impossible. If there is something in the model that is nonlinear it probably isn't an option at all. One of the great things about multiple imputation is that you get it running with say 20 imputations and then
run
it overnight with 200 or more and it probably won't change but you will know that you have enough imputations. So FIML doesn't have an advantage
in
that respect.
I'm not sure that's needed as a distinction. This quote from the r-help mailing list [0] addresses it:
I'm not sure you are correct on this. Other texts on multilevel models (e.g., Raudenbush and Bryk, Kreft and Deeuw, and Singer & Willett) all use FiML as a synonym for ML. In fact, Kreft and Deleeuw go as far to even state they are the same thing (see page 131). When you run a model in HLM selecting "Full Maximum Likelihood" and method="ML" in lme, the results, including all fixed effects, variance components, empirical bayes residuals, degrees of freedom are exactly the same. So, I think Doug [Bates] is correct in that ML == FiML. Harold
So maybe a semantics difference. However, with respect to the handling of the integral: if it's problematic, that should result in a non-convergence problem, or different results reported when he reruns the model, in terms of diagnostics. [0]https://stat.ethz.ch/pipermail/r-help/2004-August/056723.html
On 27 February 2015 at 16:20, Bonnie Dixon <bmdixon at ucdavis.edu> wrote:
I actually did try mice also (method "2l.norm"), but it seemed that
Amelia
was preferable for imputation. Mice seems to only be able to impute one variable, whereas Amelia can impute as many variables as have missing
data
producing 100% complete data sets as output. However, most of the missing data in the data set I am working with is
in
just one variable, so I could consider using mice, and just imputing the variable that has the most missing data, while omitting observations
that
have missing data in any of the other variables. But the pooled results from mice only seem to include the fixed effects of the model, so this still leaves me wondering how to report the random effects, which are
very
important to my research question. When using Amelia to impute, the packages Zelig and ZeligMultilevel can
be
used to combine the results from each of the models. But again, only
the
fixed effects seem to be included in the output, so I am not sure how to report on the random effects. Bonnie On Thu, Feb 26, 2015 at 8:33 PM, Mitchell Maltenfort <mmalten at gmail.com
wrote:
Mice might be the package you need On Thursday, February 26, 2015, Bonnie Dixon <bmdixon at ucdavis.edu>
wrote:
Dear list; I am using nlme to create a repeated measures (i.e. 2 level) model.
There
is missing data in several of the predictor variables. What is the
best
way to handle this situation? The variable with (by far) the most
missing
data is the best predictor in the model, so I would not want to remove
it.
I am also trying to avoid omitting the observations with missing data, because that would require omitting almost 40% of the observations and would result in a substantial loss of power. A member of my dissertation committee who uses SAS, recommended that I
use
full information maximum likelihood estimation (FIML) (described here:
), which is the easiest way to handle missing data in SAS. Is there an equivalent procedure in R? Alternatively, I have tried several approaches to multiple imputation. For example, I used the package, Amelia, which appears to handle the
clustered
structure of the data appropriately, to generate five imputed versions
of
the data set, and then used lapply to run my model on each. But I am
not
sure how to combine the resulting five models into one final result.
I
will need a final result that enables me to report, not just the fixed effects of the model, but also the random effects variance components
and,
ideally, the distributions across the population of the random
intercept
and slopes, and correlations between them.
Many thanks for any suggestions on how to proceed.
Bonnie
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
--
____________________________ Ersatzistician and Chutzpahthologist I can answer any question. "I don't know" is an answer. "I don't know yet" is a better answer. "I can write better than anybody who can write faster, and I can write faster than anybody who can write better" AJ Liebling
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
- -- Violence is the last refuge of the incompetent. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAEBCgAGBQJU8A5AAAoJEDeph/0fVJWsbNUP/invP0QBC1qS0sWfKrnRVM09 kV1fv4Y8rVflFnS+znsbAPDJOK+5YnvITmfoVLMdwTAWaUEyugKZVGDydY+fTDfg GxokxDpNAdGlfDBg+asw49VOFoTFtBKai0PWKyw4zHrAHYS9rzTqeO2CVq1Qlb8G F7je9naYr+iwcEkIWQZ2JloBH8OPw80UueWqNjQ0totVRN8ehYgsu2+iyyudTQnH Sl7LWkg6QnDYYVKrlV9ygd6z9yOymU9f5w52px1cUIY0mBoT12fYturEfyi/aIxF +3nBjRCE14C2c9y6mW2Lab9AYpR8bbzsmTK6y7PXid6/VxcqkZlE6Qsj4bD4zvK3 lkIdFj8BR2LdzJNI1EdM8LREA82VPrkS5LFf/4ige0pSo6X3aVoInC2ohLKGSdr5 r66Nh3tLu1a6kPtPBNw7YAxzkzRd2CKy9OTvOpz5wRqlXNvzOoq2Is7Hpoeva0yB 3hvAAgmJUtq8ZbTEXLQiDl2w/qeO+8o5KRfm/2uutN8z29S768me/6bfnvLELw9w y2R4vwOGdpp+3XBAfs8sF5bMGVvTEzZj/ILph5D7OFRJi/pfCbntnf2mAFrllvlt KUh+Okd0bO5dC2gfLuu42J3jQnCTMez/ghrEVlXkRX9XMnMz3JB7r4pdgmUqXHYu w9eXfCoXza9efwhgHF1q =LMV6 -----END PGP SIGNATURE-----
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
*Ken Beath* Lecturer Statistics Department MACQUARIE UNIVERSITY NSW 2109, Australia Phone: +61 (0)2 9850 8516 Building E4A, room 526 http://stat.mq.edu.au/our_staff/staff_-_alphabetical/staff/beath,_ken/ CRICOS Provider No 00002J This message is intended for the addressee named and may...{{dropped:9}}
Thank you very much to everyone who has replied for your helpful suggestions. For clarification about FIML (and in support of what Ken explained), my professor who does multilevel modeling in SAS tells me that in SAS, "FIML" refers to a form of maximum likelihood estimation that can accept an incomplete data set, and does not omit the observations with missing data as must be done in both "ML" and "REML" in nlme. FIML in SAS handles observations in which the data is missing for some variables by just using those variables for which data is available and integrating over the missing values. This is the default method in SAS PROC MIXED for all mixed effects models (not just for structural equation modeling). But this functionality does not appear to be available in R except for structural equation modeling (i.e. package, lavaan). Given that, I am now working on a multiple imputation solution for my problem, using either mice or Amelia, and will post again to the list once I have a working example. (Apparently, I was wrong about mice only being able to impute one variable.) How many imputations are needed? Many sources online indicate that 3-10 is usually enough, and the default in both mice and Amelia is 5. Bonnie
On Thu, Feb 26, 2015 at 11:26 PM, Ken Beath <ken.beath at mq.edu.au> wrote:
From the same posting
*From: Chris Lawrence <chris at lordsutch.com <
https://stat.ethz.ch/mailman/listinfo/r-help>>* <snip> *I have seen FIML used to refer to a type of ML estimation where a *>*missing data treatment is included in the estimation procedure *>*(parameter estimates are derived from incomplete cases for only the *>*variables present in the case, rather than simply discarding the *>*cases), at least in the latent-variable SEM context, specifically in *>*AMOS. This may be what Francisco is getting at. *>>*To my knowledge, no R packages implement this sort of "FIML", for any *>*class of models, although there are other available missing data *>*treatments (EM, MCMC estimation). * *This is what is correctly referred to as FIML. Your original post claimed that FIML was available through the ML option which is incorrect, and will not fix missing values except in the dependent variable. The fact that some software may claim that it does something that it doesn't will not change this. What could be said is that FIML is simply ML done correctly in that it builds the proper model for the data, rather than ignoring the observations with missing data, so both are maximum likelihood. * On 27 February 2015 at 17:27, landon hurley <ljrhurley at gmail.com> wrote: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 02/27/2015 01:02 AM, Ken Beath wrote: mice will impute the complete dataset, it just needs to have an imputation method setup for each variable. See the example given in the help for mice.impute.2lonly.norm Full information maximum likelihood estimation (FIML) (Note for Landon, this is ML taking into account the missing data) is only feasible if you can reformulate everything as a structural equation model and use software that can cope with this. Otherwise working with the integrals is pretty much impossible. If there is something in the model that is nonlinear it probably isn't an option at all. One of the great things about multiple imputation is that you get it running with say 20 imputations and then run it overnight with 200 or more and it probably won't change but you will know that you have enough imputations. So FIML doesn't have an advantage in that respect. I'm not sure that's needed as a distinction. This quote from the r-help mailing list [0] addresses it: I'm not sure you are correct on this. Other texts on multilevel models (e.g., Raudenbush and Bryk, Kreft and Deeuw, and Singer & Willett) all use FiML as a synonym for ML. In fact, Kreft and Deleeuw go as far to even state they are the same thing (see page 131). When you run a model in HLM selecting "Full Maximum Likelihood" and method="ML" in lme, the results, including all fixed effects, variance components, empirical bayes residuals, degrees of freedom are exactly the same. So, I think Doug [Bates] is correct in that ML == FiML. Harold So maybe a semantics difference. However, with respect to the handling of the integral: if it's problematic, that should result in a non-convergence problem, or different results reported when he reruns the model, in terms of diagnostics. [0]https://stat.ethz.ch/pipermail/r-help/2004-August/056723.html On 27 February 2015 at 16:20, Bonnie Dixon <bmdixon at ucdavis.edu> wrote: I actually did try mice also (method "2l.norm"), but it seemed that Amelia was preferable for imputation. Mice seems to only be able to impute one variable, whereas Amelia can impute as many variables as have missing data producing 100% complete data sets as output. However, most of the missing data in the data set I am working with is in just one variable, so I could consider using mice, and just imputing the variable that has the most missing data, while omitting observations that have missing data in any of the other variables. But the pooled results from mice only seem to include the fixed effects of the model, so this still leaves me wondering how to report the random effects, which are very important to my research question. When using Amelia to impute, the packages Zelig and ZeligMultilevel can be used to combine the results from each of the models. But again, only the fixed effects seem to be included in the output, so I am not sure how to report on the random effects. Bonnie On Thu, Feb 26, 2015 at 8:33 PM, Mitchell Maltenfort < mmalten at gmail.com wrote: Mice might be the package you need On Thursday, February 26, 2015, Bonnie Dixon <bmdixon at ucdavis.edu> wrote: Dear list; I am using nlme to create a repeated measures (i.e. 2 level) model. There is missing data in several of the predictor variables. What is the best way to handle this situation? The variable with (by far) the most missing data is the best predictor in the model, so I would not want to remove it. I am also trying to avoid omitting the observations with missing data, because that would require omitting almost 40% of the observations and would result in a substantial loss of power. A member of my dissertation committee who uses SAS, recommended that I use full information maximum likelihood estimation (FIML) (described here: http://www.statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf ), which is the easiest way to handle missing data in SAS. Is there an equivalent procedure in R? Alternatively, I have tried several approaches to multiple imputation. For example, I used the package, Amelia, which appears to handle the clustered structure of the data appropriately, to generate five imputed versions of the data set, and then used lapply to run my model on each. But I am not sure how to combine the resulting five models into one final result. I will need a final result that enables me to report, not just the fixed effects of the model, but also the random effects variance components and, ideally, the distributions across the population of the random intercept and slopes, and correlations between them. Many thanks for any suggestions on how to proceed. Bonnie [[alternative HTML version deleted]] _______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models -- ____________________________ Ersatzistician and Chutzpahthologist I can answer any question. "I don't know" is an answer. "I don't know yet" is a better answer. "I can write better than anybody who can write faster, and I can write faster than anybody who can write better" AJ Liebling [[alternative HTML version deleted]] _______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models - -- Violence is the last refuge of the incompetent. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAEBCgAGBQJU8A5AAAoJEDeph/0fVJWsbNUP/invP0QBC1qS0sWfKrnRVM09 kV1fv4Y8rVflFnS+znsbAPDJOK+5YnvITmfoVLMdwTAWaUEyugKZVGDydY+fTDfg GxokxDpNAdGlfDBg+asw49VOFoTFtBKai0PWKyw4zHrAHYS9rzTqeO2CVq1Qlb8G F7je9naYr+iwcEkIWQZ2JloBH8OPw80UueWqNjQ0totVRN8ehYgsu2+iyyudTQnH Sl7LWkg6QnDYYVKrlV9ygd6z9yOymU9f5w52px1cUIY0mBoT12fYturEfyi/aIxF +3nBjRCE14C2c9y6mW2Lab9AYpR8bbzsmTK6y7PXid6/VxcqkZlE6Qsj4bD4zvK3 lkIdFj8BR2LdzJNI1EdM8LREA82VPrkS5LFf/4ige0pSo6X3aVoInC2ohLKGSdr5 r66Nh3tLu1a6kPtPBNw7YAxzkzRd2CKy9OTvOpz5wRqlXNvzOoq2Is7Hpoeva0yB 3hvAAgmJUtq8ZbTEXLQiDl2w/qeO+8o5KRfm/2uutN8z29S768me/6bfnvLELw9w y2R4vwOGdpp+3XBAfs8sF5bMGVvTEzZj/ILph5D7OFRJi/pfCbntnf2mAFrllvlt KUh+Okd0bO5dC2gfLuu42J3jQnCTMez/ghrEVlXkRX9XMnMz3JB7r4pdgmUqXHYu w9eXfCoXza9efwhgHF1q =LMV6 -----END PGP SIGNATURE----- _______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models -- *Ken Beath* Lecturer Statistics Department MACQUARIE UNIVERSITY NSW 2109, Australia Phone: +61 (0)2 9850 8516 Building E4A, room 526 http://stat.mq.edu.au/our_staff/staff_-_alphabetical/staff/beath,_ken/ CRICOS Provider No 00002J This message is intended for the addressee named and m...{{dropped:10}}
On 28 February 2015 at 07:00, Bonnie Dixon <bmdixon at ucdavis.edu> wrote:
Given that, I am now working on a multiple imputation solution for my problem, using either mice or Amelia, and will post again to the list once I have a working example. (Apparently, I was wrong about mice only being able to impute one variable.) How many imputations are needed? Many sources online indicate that 3-10 is usually enough, and the default in both mice and Amelia is 5.
Others claim 20, and that seems to be more than sufficient for a lot of problems. It will depend on what proportion of your data is missing, and how dependent the outcome is on these. As you generally can't have too many then I would start with say 20 and then try a couple of larger number and if there is no change then 20 was sufficient.
For clarification about FIML (and in support of what Ken explained), my professor who does multilevel modeling in SAS tells me that in SAS, "FIML" refers to a form of maximum likelihood estimation that can accept an incomplete data set, and does not omit the observations with missing data as must be done in both "ML" and "REML" in nlme. FIML in SAS handles observations in which the data is missing for some variables by just using those variables for which data is available and integrating over the missing values. This is the default method in SAS PROC MIXED for all mixed effects models (not just for structural equation modeling).
I hate to be so blunt here, but this is just flat out wrong. proc mixed is great and all, but it doesn't do such a thing. Just like lmer() and lme() (with na.action=na.omit), proc mixed will just delete rows with missing data and then use ML or REML estimation on what's left (which is perfectly fine under certain missing data mechanisms). Consequently, fitting the same model with proc mixed and lmer() or lme() to the same data with missing data yields essentially identical results. One can easily confirm this with a few examples.
But this functionality does not appear to be available in R except for structural equation modeling (i.e. package, lavaan).
Indeed, one has to switch to some form of a latent variable model if one wants to use FIML. In R, one should look into 'lavaan' or 'sem' (or 'OpenMX' for the more adventurous). In SAS, one would need to use something like proc calis: http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/statug_calis_sect103.htm Again, proc mixed does not use FIML. I am really just repeating what Ken has already stated. Also relevant: http://stats.stackexchange.com/questions/51006/full-information-maximum-likelihood-for-missing-data-in-r Best, Wolfgang