lmer and p-values

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20110328/53ab4397/attachment.pl>
Iker Vaquero Alba <karraspito at ...> writes:

?? Dear list members:

?? I am fitting a model with lmer, because I need to fit some nested
as well as non-nested random effects in it. I am doing a split plot
simplification, dropping terms from the model and comparing the models with or
without the term. When doing and ANOVA between one model and its simplified
version, I get, as a result, a chisquare value with 1 df (df from the bigger
model - df from the simplified one), and a p-value associated.

?? I was just wondering if it's correct to present this chisquare and
p values as a result of testing the effect of a certain term in the model. I am
a bit confused, as if I was doing this same analysis with lme, I would be
getting F-values and associated p-values.

When you do anova() in this context you are doing a likelihood ratio
test, which is equivalent to doing an F test with 1 numerator df and
a very large (infinite) denominator df.  
  As Pinheiro and Bates 2000 point out, this is dangerous/anticonservative
if your data set is small, for some value of "small".
   Guessing an appropriate denominator df, or using mcmcsamp(), or parametric
bootstrapping, or something, will be necessary if you want a more
reliable p-value.
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20110328/05ba347b/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20110328/f7d1637a/attachment.pl>
   Ok, I have had a look at the mcmcsamp() function. If I've got it
right, it generates an MCMC sample from the parameters of a model fitted
preferentially with "lmer" or similar function.

   But my doubt now is: even if I cannot trust the p-values from the
ANOVA comparing two different models that differ in a term, is it still
OK if I simplify the model that way until I get my Minimum Adequate
Model, and then I use mcmcsamp() to get a trustable p-value of the terms
I'm interested in from this MAM, or should I directly use mcmcsamp()
with my Maximum model and simplify it according to the p-values obtained
with it?

   Thank you. Iker
Why are you simplifying the model in the first place?  (That is a real
question, with only a tinge of prescriptiveness.) Among the active
contributors to this list and other R lists, I would say that the most
widespread philosophy is that one should *not* do backwards elimination
of (apparently) superfluous/non-significant terms in the model.  (See
myriad posts by Frank Harrell and others.)

  If you do insist on eliminating terms, then the LRT (anova()) p-values
are no more or less reliable for the purposes of elimination than they
are for the purposes of hypothesis testing.
--- El *lun, 28/3/11, Ben Bolker /<bbolker at gmail.com>/* escribi?:

    De: Ben Bolker <bbolker at gmail.com>
    Asunto: Re: [R-sig-ME] lmer and p-values
    Para: r-sig-mixed-models at r-project.org
    Fecha: lunes, 28 de marzo, 2011 18:27

    Iker Vaquero Alba <karraspito at ...> writes:

    >
    >
    >    Dear list members:
    >
    >    I am fitting a model with lmer, because I need to fit some nested
    > as well as non-nested random effects in it. I am doing a split plot
    > simplification, dropping terms from the model and comparing the
    models with or
    > without the term. When doing and ANOVA between one model and its
    simplified
    > version, I get, as a result, a chisquare value with 1 df (df from
    the bigger
    > model - df from the simplified one), and a p-value associated.
    >
    >    I was just wondering if it's correct to present this chisquare and
    > p values as a result of testing the effect of a certain term in
    the model. I am
    > a bit confused, as if I was doing this same analysis with lme, I
    would be
    > getting F-values and associated p-values.
    >
      When you do anova() in this context you are doing a likelihood ratio
    test, which is equivalent to doing an F test with 1 numerator df and
    a very large (infinite) denominator df. 
      As Pinheiro and Bates 2000 point out, this is
    dangerous/anticonservative
    if your data set is small, for some value of "small".
       Guessing an appropriate denominator df, or using mcmcsamp(), or
    parametric
    bootstrapping, or something, will be necessary if you want a more
    reliable p-value.

    _______________________________________________
    R-sig-mixed-models at r-project.org
    </mc/compose?to=R-sig-mixed-models at r-project.org> mailing list
    https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20110328/eea05d9f/attachment.pl>
A slightly more accommodating position is that some selection 
may be acceptable if it makes little difference to the magnitudes of
parameter estimates and to the interpretations that can be placed
upon them.  [Since writing this, I notice that Ben has now posted a
message that makes broadly similar follow-up points.]

The usual interpretations of p-values assume, among other things, 
a known model.  This assumption is invalidated if there has been
some element of backward elimination or other element of variable
selection.  Following variable selection, the p-value is no longer, 
strictly, a valid p-value.

Elimination of a term with a p-value greater than say 0.15 or 0.2 is
however likely to make little differences to estimates of other terms
in the model.  Thus, it may be a reasonable way to proceed.  For
this purpose, an anti-conservative (smaller than it should be)  
p-value will usually serve the purpose.

Nowadays it is of course relatively easy to do a simulation that will 
check the effect of a particular variable elimination/selection strategy.  
If there is some use of variable elimination/selection, and anything of 
consequence hangs on the results, this should surely be standard 
practice. 

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm

On 03/28/2011 01:04 PM, Iker Vaquero Alba wrote:
  Ok, I have had a look at the mcmcsamp() function. If I've got it
right, it generates an MCMC sample from the parameters of a model fitted
preferentially with "lmer" or similar function.

  But my doubt now is: even if I cannot trust the p-values from the
ANOVA comparing two different models that differ in a term, is it still
OK if I simplify the model that way until I get my Minimum Adequate
Model, and then I use mcmcsamp() to get a trustable p-value of the terms
I'm interested in from this MAM, or should I directly use mcmcsamp()
with my Maximum model and simplify it according to the p-values obtained
with it?

  Thank you. Iker
 Why are you simplifying the model in the first place?  (That is a real
question, with only a tinge of prescriptiveness.) Among the active
contributors to this list and other R lists, I would say that the most
widespread philosophy is that one should *not* do backwards elimination
of (apparently) superfluous/non-significant terms in the model.  (See
myriad posts by Frank Harrell and others.)

 If you do insist on eliminating terms, then the LRT (anova()) p-values
are no more or less reliable for the purposes of elimination than they
are for the purposes of hypothesis testing.

--- El *lun, 28/3/11, Ben Bolker /<bbolker at gmail.com>/* escribi?:

   De: Ben Bolker <bbolker at gmail.com>
   Asunto: Re: [R-sig-ME] lmer and p-values
   Para: r-sig-mixed-models at r-project.org
   Fecha: lunes, 28 de marzo, 2011 18:27

   Iker Vaquero Alba <karraspito at ...> writes:

  Dear list members:

  I am fitting a model with lmer, because I need to fit some nested
as well as non-nested random effects in it. I am doing a split plot
simplification, dropping terms from the model and comparing the
   models with or
without the term. When doing and ANOVA between one model and its
   simplified
version, I get, as a result, a chisquare value with 1 df (df from
   the bigger
model - df from the simplified one), and a p-value associated.

  I was just wondering if it's correct to present this chisquare and
p values as a result of testing the effect of a certain term in
   the model. I am
a bit confused, as if I was doing this same analysis with lme, I
   would be
getting F-values and associated p-values.

     When you do anova() in this context you are doing a likelihood ratio
   test, which is equivalent to doing an F test with 1 numerator df and
   a very large (infinite) denominator df. 
     As Pinheiro and Bates 2000 point out, this is
   dangerous/anticonservative
   if your data set is small, for some value of "small".
      Guessing an appropriate denominator df, or using mcmcsamp(), or
   parametric
   bootstrapping, or something, will be necessary if you want a more
   reliable p-value.

   _______________________________________________
   R-sig-mixed-models at r-project.org
   </mc/compose?to=R-sig-mixed-models at r-project.org> mailing list
   https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

Elimination of a term with a p-value greater than say 0.15 or 0.2 is
however likely to make little differences to estimates of other terms
in the model.  Thus, it may be a reasonable way to proceed.  For
this purpose, an anti-conservative (smaller than it should be)  
p-value will usually serve the purpose.
Note that naive likelihood ratio tests of random effects are likely to
be conservative (in the simplest case, true p-values are twice the
nominal value) because of boundary issues and those of fixed effects are
probably anticonservative because of finite-size effects (see PB 2000
for examples of both cases.)
John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm

Ben
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20110329/36152a64/attachment.pl>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I am not a statistician, but what the p-value is telling me?

Is not more important the effect size?

Best,

Manuel

Hmm.  What's the motivation for your question?

  The p-value gives you the probability of the observed pattern, or a
more extreme one, having occurred if the null hypothesis were true.
  The effect size (defined in various ways) tells you something about
the strength of the observed pattern.
   Statistical and subject-area (in your case, biological) significance
are complementary. A highly statistically significant but biologically
trivial effect is a curiosity; a biologically important but
statistically insignificant effect means you need more/better data.

  I don't know if that answers your question.
On 28/03/2011 04:40 p.m., Ben Bolker wrote:
On 03/28/2011 06:15 PM, John Maindonald wrote:

Elimination of a term with a p-value greater than say 0.15 or 0.2 is
however likely to make little differences to estimates of other terms
in the model.  Thus, it may be a reasonable way to proceed.  For
this purpose, an anti-conservative (smaller than it should be)  
p-value will usually serve the purpose.
  Note that naive likelihood ratio tests of random effects are likely to
be conservative (in the simplest case, true p-values are twice the
nominal value) because of boundary issues and those of fixed effects are
probably anticonservative because of finite-size effects (see PB 2000
for examples of both cases.)

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm

  Ben

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

-- 
*Manuel Sp?nola, Ph.D.*
Instituto Internacional en Conservaci?n y Manejo de Vida Silvestre
Universidad Nacional
Apartado 1350-3000
Heredia
COSTA RICA
mspinola at una.ac.cr
mspinola10 at gmail.com
Tel?fono: (506) 2277-3598
Fax: (506) 2237-7036
Personal website: Lobito de r?o
<https://sites.google.com/site/lobitoderio/>
Institutional website: ICOMVIS <http://www.icomvis.una.ac.cr/>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2RzTIACgkQc5UpGjwzenMT3wCfa9orCpx295kTrVJKScLLKnGb
HSkAn3Rp5TvrdiUJZjTphkW7biIaqkip
=cACS
-----END PGP SIGNATURE-----
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20110329/837eb90a/attachment.pl>
Thank you very much Ben. ? Yes, that answer my question.

I didn't have a bad intention but for many non-statisticians I think is
confusing why there still so much emphasis on p-values.
I know that this will be controversial and I don't have the background
to discuss with a statistician but I am confused with the use of p-value
in many instances by statisticians. ?Many well known statisticians have
been very critical on the use of p-value as usually are used in
statistics. ?Here there is a link with a list of quotes of many well
known statisticians against null hypothesis significance testing
(http://warnercnr.colostate.edu/~anderson/nester.html ).
This topic (and this web page) has been discussed at length on
this list recently. Check out the archives.

I like to think of p-values and hypothesis testing as a more scientific
variant of trial by jury, where the theory to be proved ("as charged")
is found guilty by establishing that inconsistent theories (null hypotheses)
are unlikely to be true given the observed data. If the null hypothesis
is true ("beyond a reasonable doubt"), then the theory to be tested "could
not have been at the scene of the crime." Note that just as in a jury
trial, this does not prove that the theory in question is true with
absolute certainty.

In practice one usually entertains several possible models or theories
and selects the one that seems to explain the data best by eliminating
most of the variance in the observations. More precisely, a good model
is one where the residual is negligible and looks like "noise."

Dominick
Some of the quotes:

Yates - "the emphasis given to formal tests of significance ... has
resulted in ... an undue concentration of effort by mathematical
statisticians on investigations of tests of significance applicable to
problems which are of little or no practical importance ... and ... it
has caused scientific research workers to pay undue attention to the
results of the tests of significance ... and too little to the estimates
of the magnitude of the effects they are investigating"

Cochran and Cox - "In many experiments it seems obvious that the
different treatments must have produced some difference, however small,
in effect. Thus the hypothesis that there is no difference is
unrealistic: the real problem is to obtain estimates of the sizes of the
differences."

Savage - "Null hypotheses of no difference are usually known to be false
before the data are collected ... when they are, their rejection or
acceptance simply reflects the size of the sample and the power of the
test, and is not a contribution to science".

Kish - "Significance should stand for meaning and refer to substantive
matter. ... I would recommend that statisticians discard the phrase
'test of significance ".

Kish - "the tests of null hypotheses of zero differences, of no
relationships, are frequently weak, perhaps trivial statements of the
researcher's aims ... in many cases, instead of the tests of
significance it would be more to the point to measure the magnitudes of
the relationships, attaching proper statements of their sampling
variation. The magnitudes of relationships cannot be measured in terms
of levels of significance".

Nunnally - "the null-hypothesis models ... share a crippling flaw: in
the real world the null hypothesis is almost never true, and it is
usually nonsensical to perform an experiment with the sole aim of
rejecting the null hypothesis" .

Nunnally - "If rejection of the null hypothesis were the real intention
in psychological experiments, there usually would be no need to gather
data".

Yates - "The most commonly occurring weakness ... is ... undue emphasis
on tests of significance, and failure to recognise that in many types of
experimental work estimates of treatment effects, together with
estimates of the errors to which they are subject, are the quantities of
primary interest".

Yates - "In many experiments ... it is known that the null hypothesis
... is certainly untrue".

Cox - "Overemphasis on tests of significance at the expense especially
of interval estimation has long been condemned".

Kruskal - "it is easy to ... throw out an interesting baby with the
nonsignificant bath water. Lack of statistical significance at a
conventional level does not mean that no real effect is present; it
means only that no real effect is clearly seen from the data. That is
why it is of the highest importance to look at power and to compute
confidence intervals"

Kruskal - "Because of the relative simplicity of its structure,
significance testing has been overemphasized in some presentations of
statistics, and as a result some students come mistakenly to feel that
statistics is little else than significance testing"

Best,

Manuel

On 29/03/2011 06:14 a.m., Ben Bolker wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11-03-29 07:35 AM, Manuel Sp?nola wrote:
I am not a statistician, but what the p-value is telling me?

Is not more important the effect size?

Best,

Manuel

? ?Hmm. ?What's the motivation for your question?

? ?The p-value gives you the probability of the observed pattern, or a
more extreme one, having occurred if the null hypothesis were true.
? ?The effect size (defined in various ways) tells you something about
the strength of the observed pattern.
? ? Statistical and subject-area (in your case, biological) significance
are complementary. A highly statistically significant but biologically
trivial effect is a curiosity; a biologically important but
statistically insignificant effect means you need more/better data.

? ?I don't know if that answers your question.

On 28/03/2011 04:40 p.m., Ben Bolker wrote:
On 03/28/2011 06:15 PM, John Maindonald wrote:

Elimination of a term with a p-value greater than say 0.15 or 0.2 is
however likely to make little differences to estimates of other terms
in the model. ?Thus, it may be a reasonable way to proceed. ?For
this purpose, an anti-conservative (smaller than it should be)
p-value will usually serve the purpose.
? ?Note that naive likelihood ratio tests of random effects are likely to
be conservative (in the simplest case, true p-values are twice the
nominal value) because of boundary issues and those of fixed effects are
probably anticonservative because of finite-size effects (see PB 2000
for examples of both cases.)

John Maindonald ? ? ? ? ? ? email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473 ? ?fax ?: +61 2(6125)5549
Centre for Mathematics& ?Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm

? ?Ben

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

--
*Manuel Sp?nola, Ph.D.*
Instituto Internacional en Conservaci?n y Manejo de Vida Silvestre
Universidad Nacional
Apartado 1350-3000
Heredia
COSTA RICA
mspinola at una.ac.cr
mspinola10 at gmail.com
Tel?fono: (506) 2277-3598
Fax: (506) 2237-7036
Personal website: Lobito de r?o
<https://sites.google.com/site/lobitoderio/>
Institutional website: ICOMVIS<http://www.icomvis.una.ac.cr/>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2RzTIACgkQc5UpGjwzenMT3wCfa9orCpx295kTrVJKScLLKnGb
HSkAn3Rp5TvrdiUJZjTphkW7biIaqkip
=cACS
-----END PGP SIGNATURE-----

--
*Manuel Sp?nola, Ph.D.*
Instituto Internacional en Conservaci?n y Manejo de Vida Silvestre
Universidad Nacional
Apartado 1350-3000
Heredia
COSTA RICA
mspinola at una.ac.cr
mspinola10 at gmail.com
Tel?fono: (506) 2277-3598
Fax: (506) 2237-7036
Personal website: Lobito de r?o
<https://sites.google.com/site/lobitoderio/>
Institutional website: ICOMVIS <http://www.icomvis.una.ac.cr/>

? ? ? ?[[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20110329/f8798cb3/attachment.pl>
Dear Dominick,

Thank you for your message.

In my opinion, the relationship of theories (and scientific hypothesis)
is not so straightforward to hypothesis testing (statistical hypothesis)
as many people think, but certainly the p-value is not going to help
much on that relationship.

If somebody is entertains several possible models why not to:

Pr(Model | data) instead of Pr(data | H0)?

Best,

Manuel
A couple of points:

  * p-values certainly have their problems, but despite their problems
they answer a need.  Fisher/Neymann/Pearson were pretty smart guys, and
the question that p-values answer ("how likely is it that I would see a
pattern this strong, or stronger, if there were really nothing
happening?") is one that we often want to ask.  It's also nice to have a
concise, general statement of the strength of an effect, even if it has
flaws (arguably we could all be quoting log-likelihood differences, or
standardized regression coefficients, instead).
  * Notice how often the quotes that you posted below say "overuse", or
"undue", or "too much emphasis" (rather than "never" or "forbidden").
Yes, if I had to choose between a p-value and a confidence interval I
would take the confidence interval every time -- but then I have to
decide what kind of confidence interval I want, and if I decide to use
frequentist confidence intervals I am back in the soup again, both with
interpretation and with the difficulties (in the mixed model context) of
computing them appropriately.
  * I wouldn't object if everyone decided to go Bayesian, but that does
have its own cans of worms (deciding on priors, computing [deciding
about convergence if using MCMC], etc.).  Again, if I had to choose
between frequentist *only* or Bayesian *only* I would probably choose
Bayesian. The hybrid-Bayesian approaches (e.g. mcmcsamp, post-estimation
MCMC in AD Model Builder) choose flat priors on the (perhaps arbitrarily
chosen) current scale of the parameters, glossing over details that are
sometimes important.  (The same goes for the pseudo-Bayesian
interpretation of AIC.)

  I agree that the relations among scientific theory and statistical
practices are tough. From Crome 1997:

18.  Use statistical procedures from a range of schools and strictly
adhere to their respective methods and interpretation. For example, do a
Fisherian significance test properly and interpret it properly. Then set
up a formal Neymann-Pearson test and interpret it formally (this means
setting up both Type I and II error rates beforehand, among other
things). Then do an estimation procedure. Then switch hats and do a
Bayesian analysis. Take the results of all four, noting their different
behavior, and come to your conclusion. Good analysis and interpretation
are as important as the fieldwork, so allot adequate time and resources
to both.  ....

Crome, Francis H. J. 1997. Researching tropical forest fragmentation:
Shall we keep on doing what we?re doing? In Tropical forest remnants:
ecology, management, and conservation of fragmented communities, ed. W.
F Laurance and R. O Bierregard, 485-501. Chicago, IL: University of
Chicago Press.

  (There is more here that's worth reading.)

On 29/03/2011 08:51 a.m., Dominick Samperi wrote:
On Tue, Mar 29, 2011 at 8:45 AM, Manuel Sp?nola <mspinola10 at gmail.com> wrote:
Thank you very much Ben.   Yes, that answer my question.

I didn't have a bad intention but for many non-statisticians I think is
confusing why there still so much emphasis on p-values.
I know that this will be controversial and I don't have the background
to discuss with a statistician but I am confused with the use of p-value
in many instances by statisticians.  Many well known statisticians have
been very critical on the use of p-value as usually are used in
statistics.  Here there is a link with a list of quotes of many well
known statisticians against null hypothesis significance testing
(http://warnercnr.colostate.edu/~anderson/nester.html ).
This topic (and this web page) has been discussed at length on
this list recently. Check out the archives.

I like to think of p-values and hypothesis testing as a more scientific
variant of trial by jury, where the theory to be proved ("as charged")
is found guilty by establishing that inconsistent theories (null hypotheses)
are unlikely to be true given the observed data. If the null hypothesis
is true ("beyond a reasonable doubt"), then the theory to be tested "could
not have been at the scene of the crime." Note that just as in a jury
trial, this does not prove that the theory in question is true with
absolute certainty.

In practice one usually entertains several possible models or theories
and selects the one that seems to explain the data best by eliminating
most of the variance in the observations. More precisely, a good model
is one where the residual is negligible and looks like "noise."

Dominick

Some of the quotes: