lme vs. lmer
On Wed, Oct 7, 2009 at 6:03 AM, Raldo Kruger <raldo.kruger at gmail.com> wrote:
Hi Paul, Thanks for your response - it's much appreciated! Yes, I confess I'm new to R and regression modeling, so it has been a challenge trying to decipher the 'R- and modeling-speak' of the experts! I do understand what you've said below (which is great because it confirms that i'm on the right track...), however, my question relates more to how one decides which terms to drop. As i understand, one can use the p-values to determined which terms to drop first (i.e. drop the term with the highest p-value, then re-run the model and compare the two results with anova; if there's a significant difference, then accept the new model. Correct?)
No, that is backwards. If the anova says the two models are different, keep the full, unrestricted model. The significance of the difference indicates you lost "explanatory power" when you removed a variable or recoded to remove a parameter. This is done since one wants a model
that explains the data well, and non-significant terms don't contribute to explaining the data, right? The challenge with using glmer with quasipoisson is that it does not give p-values for each term or interaction - so how does one decide which term to drop...if any. From what i've gleaned from previous advice from others is that one can drop any term and compare the results with the full model by anova to get the p-value for that term. So should one do that for all the terms first (one at a time), then drop the term with the highest p-value, and then repeat the whole process again?
It's not a problem that "the program does not give you a p-value." The problem is that the statistical theory is tenuous enough that nobody is sure what the p-value ought to be. The quasi model is based on the idea that nobody know the sampling distribution of the random component very well, so the standard errors are, well, sorta not-really-standard. I think you'd probably want to calculate some kind of robust standard error, but I've not done it with a quasi-poisson model. Instead of a quasi-poisson, you could look around for a mixed model program that has a negative binomial option. For negative binomial, we have more exact model of randomness and the standard errors are more well understood. lmer does not have nb, but I'm pretty sure I've seen one somewhere. If you really believe you want a p-value, you can calculate one yourself. From the "summary" output, inspect and you'll see there are columns of b's and standard errrors. divide away for yourself. The problem of deciding which variable to drop is a hard one, it is the same in ordinary regression. I'm trained in a tradition that says you should try to choose variables by theory, and don't commit the sin of dropping variables just because they are "not significant". If there is any multicollinearity, the dropping process may lead to mistaken conclusions. This is the flaw in so-called "stepwise" regression. You are a "bonehead" if you let the model tell you which parameters to include. Models will lie to you. Model pruning of that sort--the search for "significant" estimates--produces bad t-tests and a lot of silly articles getting published. I've seen economists and political scientists crop up with survey articles saying that just about everything we publish is misleading/wrong because of the model pruning approach. I've wondered if we could not work out a "regression tree" or "forest" framework to choose which variables are needed in your glm. I read a lot about it, but concluded it did not exactly fit my need. If you are looking for some rigorous justification to include/exclude variables, I think you have to look in that more exotic direction. I saw a beautiful presentation about the LASSO that selects and estimates and accounts for shrinkage as well. There's an article by Ed Leamer from AER with a title like "Let's take the con out of econometrics." it deals with the variable selection problem. I *think* his suggested approach would be the one we call Bayesian Model Averaging today. If you fit a lot of models, drop variables in and out, then the final result should somehow summarize the variety of estimates you observed. Sorry, this is preaching in the wrong context. You don't really have a r-sig-mixed problem here, you have a more general (somewhat religious) question about regression modeling. pj
Any further advice would be greatly appreciated. Thanks, Raldo PS. I had attached the data in a Text file in the previous e-mail; is that not acceptable for R Help? (I've attached the file again and it is also available here). On Tue, Oct 6, 2009 at 10:19 PM, Paul Johnson <pauljohn32 at gmail.com> wrote:
On Wed, Sep 30, 2009 at 2:40 PM, Raldo Kruger <raldo.kruger at gmail.com> wrote:
Chris, Thanks for that - I should probably have mentioned that I'm using family=quasipoisson ?in glmer since my data has Poisson distribution as well as being overdispersed. I'm unsure how one decides which term to drop without being informed by p-values, and so don't quite understand how the "Likelihood ratio test using anova()" , or the AIC or BIC model comparison will work in this case (I thought one's supposed to remove the term with the highest p-value from the model, and compare it with the model with the term included to see if there's a difference, not so?).
Dear Raldo: Finally there is a question that I can help with! ?It appears to me you don't have much experience with regression modeling in R and the other people who answer you are talking a bit "over your head". In your case, call this fit the "full" or "unrestricted" model: ex4o_r2<-glmer(Counts~N+G+Year+N:Year+G:Year+N:G:Year+(Year|Site), data=ex4o, family=quasipoisson) (I'm not commenting the specification). Suppose you wonder "Should I leave out the multiplicative effect between N and Year?" ?THen you fit the model that excludes that one element: ex4o_new <-glmer(Counts~N+G+Year +G:Year+N:G:Year+(Year|Site), data=ex4o, family=quasipoisson) And then use the anova function to compare the 2 models anova(ex4o_r2, ex4o_new) This is a "pretty standard" R regression thing, similar to what people do with all sorts of models in R. ?It is what the "drop1" function does for lm models, I believe. You may have seen regression models where an F test is done comparing a full model against a restricted model? ?This is following the same line of thought. To test your question about the factor levels, here is what you should do. SUppose the initial factor has 5 levels, and you wonder "do I really need 5 levels, or can I drop out the separate estimations for 3 of the levels?" ?Create a new factor with the simpler structure, run it through the model in place of the original factor, and run anova to compare the 2 models. ?I wrote down some of those ideas in a paper last spring (http://pj.freefaculty.org/Papers/MidWest09/Midwest09.pdf), but when I was done it seemed so obvious to me (& my friends) that I did not try to publish it. With anova, there is a test= option where you can specify if you want a chisq or F test. And, for future reference, when the experts ask you for a data example to work on, they do not mean a copy of your printout, although that may help. ?What they want is the actual data and commands that you use. ?Usually, you have to upload the data somewhere for us to see, along with the code. Good luck with your project. pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas
-- Raldo On Tue, Oct 6, 2009 at 10:19 PM, Paul Johnson <pauljohn32 at gmail.com> wrote:
On Wed, Sep 30, 2009 at 2:40 PM, Raldo Kruger <raldo.kruger at gmail.com> wrote:
Chris, Thanks for that - I should probably have mentioned that I'm using family=quasipoisson ?in glmer since my data has Poisson distribution as well as being overdispersed. I'm unsure how one decides which term to drop without being informed by p-values, and so don't quite understand how the "Likelihood ratio test using anova()" , or the AIC or BIC model comparison will work in this case (I thought one's supposed to remove the term with the highest p-value from the model, and compare it with the model with the term included to see if there's a difference, not so?).
Dear Raldo: Finally there is a question that I can help with! ?It appears to me you don't have much experience with regression modeling in R and the other people who answer you are talking a bit "over your head". In your case, call this fit the "full" or "unrestricted" model: ex4o_r2<-glmer(Counts~N+G+Year+N:Year+G:Year+N:G:Year+(Year|Site), data=ex4o, family=quasipoisson) (I'm not commenting the specification). Suppose you wonder "Should I leave out the multiplicative effect between N and Year?" ?THen you fit the model that excludes that one element: ex4o_new <-glmer(Counts~N+G+Year +G:Year+N:G:Year+(Year|Site), data=ex4o, family=quasipoisson) And then use the anova function to compare the 2 models anova(ex4o_r2, ex4o_new) This is a "pretty standard" R regression thing, similar to what people do with all sorts of models in R. ?It is what the "drop1" function does for lm models, I believe. You may have seen regression models where an F test is done comparing a full model against a restricted model? ?This is following the same line of thought. To test your question about the factor levels, here is what you should do. SUppose the initial factor has 5 levels, and you wonder "do I really need 5 levels, or can I drop out the separate estimations for 3 of the levels?" ?Create a new factor with the simpler structure, run it through the model in place of the original factor, and run anova to compare the 2 models. ?I wrote down some of those ideas in a paper last spring (http://pj.freefaculty.org/Papers/MidWest09/Midwest09.pdf), but when I was done it seemed so obvious to me (& my friends) that I did not try to publish it. With anova, there is a test= option where you can specify if you want a chisq or F test. And, for future reference, when the experts ask you for a data example to work on, they do not mean a copy of your printout, although that may help. ?What they want is the actual data and commands that you use. ?Usually, you have to upload the data somewhere for us to see, along with the code. Good luck with your project. pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas
-- Raldo
Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas