lme vs. lmer - R-SIG-mixed-models

Tue, Oct 6, 2009 1:19 PM #

On Wed, Sep 30, 2009 at 2:40 PM, Raldo Kruger <raldo.kruger at gmail.com> wrote:

Dear Raldo:

Finally there is a question that I can help with!  It appears to me
you don't have much experience with regression modeling in R and the
other people who answer you are talking a bit "over your head".

In your case, call this fit the "full" or "unrestricted" model:

ex4o_r2<-glmer(Counts~N+G+Year+N:Year+G:Year+N:G:Year+(Year|Site),
data=ex4o, family=quasipoisson)

(I'm not commenting the specification).

Suppose you wonder "Should I leave out the multiplicative effect
between N and Year?"  THen you fit the model that excludes that one
element:

ex4o_new <-glmer(Counts~N+G+Year +G:Year+N:G:Year+(Year|Site),
data=ex4o, family=quasipoisson)

And then use the anova function to compare the 2 models

anova(ex4o_r2, ex4o_new)

This is a "pretty standard" R regression thing, similar to what people
do with all sorts of models in R.  It is what the "drop1" function
does for lm models, I believe.

You may have seen regression models where an F test is done comparing
a full model against a restricted model?  This is following the same
line of thought.

To test your question about the factor levels, here is what you should
do. SUppose the initial factor has 5 levels, and you wonder "do I
really need 5 levels, or can I drop out the separate estimations for 3
of the levels?"  Create a new factor with the simpler structure, run
it through the model in place of the original factor, and run anova to
compare the 2 models.  I wrote down some of those ideas in a paper
last spring (http://pj.freefaculty.org/Papers/MidWest09/Midwest09.pdf),
but when I was done it seemed so obvious to me (& my friends) that I
did not try to publish it.

With anova, there is a test= option where you can specify if you want
a chisq or F test.

And, for future reference, when the experts ask you for a data example
to work on, they do not mean a copy of your printout, although that
may help.  What they want is the actual data and commands that you
use.  Usually, you have to upload the data somewhere for us to see,
along with the code.

Good luck with your project.

pj

Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas

Raldo Kruger

Wed, Oct 7, 2009 4:03 AM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20091007/4814df63/attachment.pl>

Paul Johnson

Wed, Oct 7, 2009 11:37 AM #

On Wed, Oct 7, 2009 at 6:03 AM, Raldo Kruger <raldo.kruger at gmail.com> wrote:

No, that is backwards.  If the anova says the two models are
different, keep the full, unrestricted model. The significance of the
difference indicates you lost "explanatory power" when you removed a
variable or recoded to remove a parameter.


This is done since one wants a model

It's not a problem that "the program does not give you a p-value."
The problem is that the statistical theory is tenuous enough that
nobody is sure what the p-value ought to be.  The quasi model is based
on the idea that nobody know the sampling distribution of the random
component very well, so the standard errors are, well, sorta
not-really-standard.  I think you'd probably want to calculate some
kind of robust standard error, but I've not done it with a
quasi-poisson model.

Instead of a quasi-poisson, you could look around for a mixed model
program that has a negative binomial option.  For negative binomial,
we have more exact model of randomness and the standard errors are
more well understood.  lmer does not have nb, but I'm pretty sure I've
seen one somewhere.

If you really believe you want a p-value, you can calculate one
yourself.  From the "summary" output, inspect and you'll see there are
columns of b's and standard errrors.  divide away for yourself.

The problem of deciding which variable to drop is a hard one, it is
the same in ordinary regression.

I'm trained in a tradition that says you should try to choose
variables by theory, and don't commit the sin of dropping variables
just because they are "not significant".  If there is any
multicollinearity, the dropping process may lead to mistaken
conclusions.  This is the flaw in so-called "stepwise" regression. You
are a "bonehead" if you let the model tell you which parameters to
include.  Models will lie to you.  Model pruning of that sort--the
search for "significant" estimates--produces bad t-tests and a lot of
silly articles getting published.  I've seen economists and political
scientists crop up with survey articles saying that just about
everything we publish is misleading/wrong because of the model pruning
approach.

I've wondered if we could not work out a "regression tree" or "forest"
framework to choose which variables are needed in your glm.  I read a
lot about it, but concluded it did not exactly fit my need.  If you
are looking for some rigorous justification to include/exclude
variables, I think you have to look in that more exotic direction.  I
saw a beautiful presentation about the LASSO that selects and
estimates and accounts for shrinkage as well.

There's an article by Ed Leamer from AER with a title like "Let's take
the con out of econometrics."  it deals with the variable selection
problem. I *think* his suggested approach would be the one we call
Bayesian Model Averaging today.   If you fit a lot of models, drop
variables in and out, then the final result should somehow summarize
the variety of estimates you observed.

Sorry, this is preaching in the wrong context. You don't really have a
r-sig-mixed problem here, you have a more general (somewhat religious)
question about regression modeling.

pj

Any further advice would be greatly appreciated.
Thanks,
Raldo


PS. I had attached the data in a Text file in the previous e-mail; is that
not acceptable for R Help? (I've attached the file again and it is also
available here).



On Tue, Oct 6, 2009 at 10:19 PM, Paul Johnson <pauljohn32 at gmail.com> wrote:

On Wed, Sep 30, 2009 at 2:40 PM, Raldo Kruger <raldo.kruger at gmail.com>
wrote:

Chris,
Thanks for that - I should probably have mentioned that I'm using
family=quasipoisson ?in glmer since my data has Poisson distribution
as well as being overdispersed. I'm unsure how one decides which term
to drop without being informed by p-values, and so don't quite
understand how the "Likelihood ratio test using anova()" , or the AIC
or BIC model comparison will work in this case (I thought one's
supposed to remove the term with the highest p-value from the model,
and compare it with the model with the term included to see if there's
a difference, not so?).

Dear Raldo:

Finally there is a question that I can help with! ?It appears to me
you don't have much experience with regression modeling in R and the
other people who answer you are talking a bit "over your head".

In your case, call this fit the "full" or "unrestricted" model:

ex4o_r2<-glmer(Counts~N+G+Year+N:Year+G:Year+N:G:Year+(Year|Site),
data=ex4o, family=quasipoisson)

(I'm not commenting the specification).

Suppose you wonder "Should I leave out the multiplicative effect
between N and Year?" ?THen you fit the model that excludes that one
element:

ex4o_new <-glmer(Counts~N+G+Year +G:Year+N:G:Year+(Year|Site),
data=ex4o, family=quasipoisson)

And then use the anova function to compare the 2 models

anova(ex4o_r2, ex4o_new)

This is a "pretty standard" R regression thing, similar to what people
do with all sorts of models in R. ?It is what the "drop1" function
does for lm models, I believe.

You may have seen regression models where an F test is done comparing
a full model against a restricted model? ?This is following the same
line of thought.

To test your question about the factor levels, here is what you should
do. SUppose the initial factor has 5 levels, and you wonder "do I
really need 5 levels, or can I drop out the separate estimations for 3
of the levels?" ?Create a new factor with the simpler structure, run
it through the model in place of the original factor, and run anova to
compare the 2 models. ?I wrote down some of those ideas in a paper
last spring (http://pj.freefaculty.org/Papers/MidWest09/Midwest09.pdf),
but when I was done it seemed so obvious to me (& my friends) that I
did not try to publish it.

With anova, there is a test= option where you can specify if you want
a chisq or F test.

And, for future reference, when the experts ask you for a data example
to work on, they do not mean a copy of your printout, although that
may help. ?What they want is the actual data and commands that you
use. ?Usually, you have to upload the data somewhere for us to see,
along with the code.

Good luck with your project.

pj

--
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas



--
Raldo


On Tue, Oct 6, 2009 at 10:19 PM, Paul Johnson <pauljohn32 at gmail.com> wrote:

On Wed, Sep 30, 2009 at 2:40 PM, Raldo Kruger <raldo.kruger at gmail.com>
wrote:

Chris,
Thanks for that - I should probably have mentioned that I'm using
family=quasipoisson ?in glmer since my data has Poisson distribution
as well as being overdispersed. I'm unsure how one decides which term
to drop without being informed by p-values, and so don't quite
understand how the "Likelihood ratio test using anova()" , or the AIC
or BIC model comparison will work in this case (I thought one's
supposed to remove the term with the highest p-value from the model,
and compare it with the model with the term included to see if there's
a difference, not so?).

Dear Raldo:

Finally there is a question that I can help with! ?It appears to me
you don't have much experience with regression modeling in R and the
other people who answer you are talking a bit "over your head".

In your case, call this fit the "full" or "unrestricted" model:

ex4o_r2<-glmer(Counts~N+G+Year+N:Year+G:Year+N:G:Year+(Year|Site),
data=ex4o, family=quasipoisson)

(I'm not commenting the specification).

Suppose you wonder "Should I leave out the multiplicative effect
between N and Year?" ?THen you fit the model that excludes that one
element:

ex4o_new <-glmer(Counts~N+G+Year +G:Year+N:G:Year+(Year|Site),
data=ex4o, family=quasipoisson)

And then use the anova function to compare the 2 models

anova(ex4o_r2, ex4o_new)

This is a "pretty standard" R regression thing, similar to what people
do with all sorts of models in R. ?It is what the "drop1" function
does for lm models, I believe.

You may have seen regression models where an F test is done comparing
a full model against a restricted model? ?This is following the same
line of thought.

To test your question about the factor levels, here is what you should
do. SUppose the initial factor has 5 levels, and you wonder "do I
really need 5 levels, or can I drop out the separate estimations for 3
of the levels?" ?Create a new factor with the simpler structure, run
it through the model in place of the original factor, and run anova to
compare the 2 models. ?I wrote down some of those ideas in a paper
last spring (http://pj.freefaculty.org/Papers/MidWest09/Midwest09.pdf),
but when I was done it seemed so obvious to me (& my friends) that I
did not try to publish it.

With anova, there is a test= option where you can specify if you want
a chisq or F test.

And, for future reference, when the experts ask you for a data example
to work on, they do not mean a copy of your printout, although that
may help. ?What they want is the actual data and commands that you
use. ?Usually, you have to upload the data somewhere for us to see,
along with the code.

Good luck with your project.

pj

--
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas



--
Raldo

Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas