An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20120726/852f87e3/attachment.pl>
Specifying 'correct' degrees of freedom for within-subject factor in *nlme/lme* repeated measures ANOVA?
5 messages · Stephanie Avery-Gomm, Ben Bolker, Peter Dalgaard +1 more
1 day later
Stephanie Avery-Gomm <stephanie.averygomm at ...> writes:
Hello, I am using *nlme* to do a mixed effect repeated measures ANCOVA, with two additional fixed factors but a limited sample size. *I am seeking clarification on how to/if I should adjust the inflated degrees of freedom for a within-subject factor as a way of dealing with the temporal pseudoreplication. *I am not using lmer so I am not sure the FAQ or Bates discussion re: adjusting df in lme4/lmer applies ( https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html).* *More information: At 3 Sites on a river I measured fish Population in approximately 9 stream Channel Units. Each Channel Unit was classified as a Habitat, with three levels (Glide, Riffle, Pool). I sampled each Channel Unit 3 times over the course of the summer, each time taking a Discharge Measurement (thus the exact Discharge differs a little from Site to Site, and so is a continuous variable). I want to know if fish Population in stream habitats (Glides, Riffles, Pools) changes as discharge decreases over the summer, if there an interaction and if fish Populations differ between habitats or between sites? The model I have settled on looks like this: Pop.Model<-lme(Pop~Site+Habitat*Discharge, random=~1|ChannelUnit, correlation=corCAR1(),data=mydata) Inclusion of the three repeated measurements of Population in each Channel Unit results in temporal pseudoreplication *and the degrees of freedom for the within-subjects factor (Discharge) is 42, but I only have 26 Channel Units, so this is obviously inflated (should be 21). I read in The R Book (Crawley: *(Pg. 644*) that I can fix this by specifying the degrees of freedom. But how?* * Although I?ve read a ton online, including Bates info re: SAS PROC Mixed versus R lmer and degrees of freedom I find that I am still quite confused. If anyone can offer specific advice on how I can adjust my degrees of freedom for the within-subjects factor in nlme or explain in accessible terms why I don?t need to, I would be very grateful. * Just in case I haven't provided enough information, here is my data and r code. .csv file: https://www.dropbox.com/s/2ijgq74di3hmo8i/R.Help.csv .R file: https://www.dropbox.com/s/puj5maifxc2rfcg/R%20Help.R .doc with code & diagram: https://www.dropbox.com/s/29dtofc62t957co/R%20Help.doc Sincerely, Stephanie Avery-Gomm MSc. Candidate, Zoology Department University of British Columbia
Are you sure that 42 (which is a propitious number in any case, see _The Hitchhiker's Guide to the Galaxy_) is *not* the right number of df for Discharge? Continuous predictors often behave differently from discrete ones: in particular, see the discussion at http://tinyurl.com/ntygq3 (referenced from http://glmm.wikidot.com/faq) about how lme computes degrees of freedom: "a term is _outer_ to a grouping factor if its value does not change within levels of the grouping factor", thus if Discharge takes on different values within each Channel Unit then it is estimated at the innermost level. Crawley doesn't actually say (AFAICT) that you ought to be manually adjusting the df provided by lme: "You use all of the data in the model, and you specify its structure appropriately so that the hypotheses are tested with the correct degrees of freedom (10 in this case, not 48)". For the case he is examining, he is using an interaction between the continuous predictor (week) and the grouping factor (plant), *and* the weeks measured are the same for each plant. I won't say that lme *always* gets the df 'right', but I don't think I've ever seen a case where there was an unambiguous right answer (i.e. the situation matched a classical experimental design so that the problem could also be expressed as a standard method-of-moments ANOVA with a well defined denominator df) *and* lme got it wrong. I would suggest: (a) trying out a variety of examples (cross {discrete predictors, continuous predictors with identical values within each group, continuous with different values in each group} with {random intercept only, random intercept + random slope}); (b) looking in an alternative source such as Ellison and Gotelli's _Primer of Ecological Statistics_ to try to convince yourself about the appropriate df. Two more issues: * if the qualitative and quantitative structure of your data allow it, you should consider adding interactions of Discharge with random (Channel Unit) and fixed effects (Site) in your model (see Schielzeth and Forstmeier 2009). * Another minor can of worms is that one might consider adjusting the 'denominator df' for the autoregressive structure -- if the points are not all independent, then the effective df will be slightly smaller. In principle one can do this with Satterthwaite or Kenward-Roger approximations, but I don't know if anyone's implemented them for lme models (pbkrtest implements them for lme4 models, but those don't allow temporal autocorrelation structures. Have you looked at the ACF() output to see if the temporal correlation structure is really necessary for your data?) However, I would be tempted to sweep this under the rug (as Crawley seems to; he doesn't mention df again when discussing autocorrelation structures). (I will also point out that is is **not** kosher in my opinion to post a public link to the entirety of a copyrighted (and non-open) work; it would be fair use, I think, to post a copy of a relevant page or two, or to point to it on Google Books <http://books.google.com/books?id=8D4HVx0apZQC&pg=PA644>.) @article{schielzeth_conclusions_2009, title = {Conclusions beyond support: overconfident estimates in mixed models}, volume = {20}, number = {2}, journal = {Behavioral Ecology}, author = {Schielzeth, Holger and Forstmeier, Wolfgang}, month = mar, year = {2009}, issn = {1045-2249, 1465-7279}, shorttitle = {Conclusions beyond support}, url = {http://beheco.oxfordjournals.org/content/20/2/416}, doi = {10.1093/beheco/arn145}, pages = {416--420}, }
On Jul 28, 2012, at 00:16 , Ben Bolker wrote:
I won't say that lme *always* gets the df 'right', but I don't think I've ever seen a case where there was an unambiguous right answer (i.e. the situation matched a classical experimental design so that the problem could also be expressed as a standard method-of-moments ANOVA with a well defined denominator df) *and* lme got it wrong.
Haven't played with lme for a while, but models with crossed random effects is a pretty sure way to get df wrong. E.g., this one which I dug out from some 2006 slides:
library(nlme) summary(aov(logDens~sample*dilut+Error(Block/(sample*dilut)), data=Assay))
Error: Block
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 1 0.008311 0.008311
Error: Block:sample
Df Sum Sq Mean Sq F value Pr(>F)
sample 5 0.27615 0.05523 11.21 0.00952 **
Residuals 5 0.02463 0.00493
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
Error: Block:dilut
Df Sum Sq Mean Sq F value Pr(>F)
dilut 4 3.749 0.9373 420.8 1.68e-05 ***
Residuals 4 0.009 0.0022
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
Error: Block:sample:dilut
Df Sum Sq Mean Sq F value Pr(>F)
sample:dilut 20 0.05552 0.002776 1.607 0.149
Residuals 20 0.03455 0.001728
This is not "officially" supported by lme, but you can cheat it to fit the model:
as3 <- lme(logDens~sample*dilut, data=Assay,
+ random=list(Block=~1, + Block=pdIdent(~sample-1), + dilut=~1))
anova(as3)
numDF denDF F-value p-value (Intercept) 1 25 538.0174 <.0001 sample 5 25 11.2133 <.0001 dilut 4 4 420.7911 <.0001 sample:dilut 20 25 1.6069 0.1301 but as you see, the F-values are right but the denDF are not.
Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
Hello - The proper degrees of freedom will depend on what error you specify. I'd expect there is a between Channels component of error that affects the slope estimates. In that case your error term maybe should be random=~Discharge|ChannelUnit One can of course check whether the estimate of the relevant component of variance is greater than 0, and if so whether it is of any consequence. In such analyses, it is commonly assumed that the variation between slopes can be entirely explained by variation about a line whose clop is constant (here, across ChannelUnits). Experience with previous such data may establish whether this is a reasonable assumption. A cautious analyst will, if the data allow it, want to check this assumption. The degrees of freedom are of consequence only when you want to move from F-statistics or t-statistics or other such statistics to p-values. They are part of a mechanism that is used to get approximate p-values. The p-values are, in general, not well-defined -- assumptions are made along the lines of the assumptions that underpin the Behrens-Fisher test. If the model is balanced they can in general, most pundits will I think argue, be used as a reasonable guide. If the model is badly unbalanced, they should be treated with caution. John Maindonald email: john.maindonald at anu.edu.au phone : +61 2 (6125)3473 fax : +61 2(6125)5549 Centre for Mathematics & Its Applications, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. http://www.maths.anu.edu.au/~johnm
On 28/07/2012, at 8:16 AM, Ben Bolker wrote:
Stephanie Avery-Gomm <stephanie.averygomm at ...> writes:
Hello, I am using *nlme* to do a mixed effect repeated measures ANCOVA, with two additional fixed factors but a limited sample size. *I am seeking clarification on how to/if I should adjust the inflated degrees of freedom for a within-subject factor as a way of dealing with the temporal pseudoreplication. *I am not using lmer so I am not sure the FAQ or Bates discussion re: adjusting df in lme4/lmer applies ( https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html).* *More information: At 3 Sites on a river I measured fish Population in approximately 9 stream Channel Units. Each Channel Unit was classified as a Habitat, with three levels (Glide, Riffle, Pool). I sampled each Channel Unit 3 times over the course of the summer, each time taking a Discharge Measurement (thus the exact Discharge differs a little from Site to Site, and so is a continuous variable). I want to know if fish Population in stream habitats (Glides, Riffles, Pools) changes as discharge decreases over the summer, if there an interaction and if fish Populations differ between habitats or between sites? The model I have settled on looks like this: Pop.Model<-lme(Pop~Site+Habitat*Discharge, random=~1|ChannelUnit, correlation=corCAR1(),data=mydata) Inclusion of the three repeated measurements of Population in each Channel Unit results in temporal pseudoreplication *and the degrees of freedom for the within-subjects factor (Discharge) is 42, but I only have 26 Channel Units, so this is obviously inflated (should be 21). I read in The R Book (Crawley: *(Pg. 644*) that I can fix this by specifying the degrees of freedom. But how?* * Although I?ve read a ton online, including Bates info re: SAS PROC Mixed versus R lmer and degrees of freedom I find that I am still quite confused. If anyone can offer specific advice on how I can adjust my degrees of freedom for the within-subjects factor in nlme or explain in accessible terms why I don?t need to, I would be very grateful. * Just in case I haven't provided enough information, here is my data and r code. .csv file: https://www.dropbox.com/s/2ijgq74di3hmo8i/R.Help.csv .R file: https://www.dropbox.com/s/puj5maifxc2rfcg/R%20Help.R .doc with code & diagram: https://www.dropbox.com/s/29dtofc62t957co/R%20Help.doc Sincerely, Stephanie Avery-Gomm MSc. Candidate, Zoology Department University of British Columbia
Are you sure that 42 (which is a propitious number in any case, see _The Hitchhiker's Guide to the Galaxy_) is *not* the right number of df for Discharge? Continuous predictors often behave differently from discrete ones: in particular, see the discussion at http://tinyurl.com/ntygq3 (referenced from http://glmm.wikidot.com/faq) about how lme computes degrees of freedom: "a term is _outer_ to a grouping factor if its value does not change within levels of the grouping factor", thus if Discharge takes on different values within each Channel Unit then it is estimated at the innermost level. Crawley doesn't actually say (AFAICT) that you ought to be manually adjusting the df provided by lme: "You use all of the data in the model, and you specify its structure appropriately so that the hypotheses are tested with the correct degrees of freedom (10 in this case, not 48)". For the case he is examining, he is using an interaction between the continuous predictor (week) and the grouping factor (plant), *and* the weeks measured are the same for each plant. I won't say that lme *always* gets the df 'right', but I don't think I've ever seen a case where there was an unambiguous right answer (i.e. the situation matched a classical experimental design so that the problem could also be expressed as a standard method-of-moments ANOVA with a well defined denominator df) *and* lme got it wrong. I would suggest: (a) trying out a variety of examples (cross {discrete predictors, continuous predictors with identical values within each group, continuous with different values in each group} with {random intercept only, random intercept + random slope}); (b) looking in an alternative source such as Ellison and Gotelli's _Primer of Ecological Statistics_ to try to convince yourself about the appropriate df. Two more issues: * if the qualitative and quantitative structure of your data allow it, you should consider adding interactions of Discharge with random (Channel Unit) and fixed effects (Site) in your model (see Schielzeth and Forstmeier 2009). * Another minor can of worms is that one might consider adjusting the 'denominator df' for the autoregressive structure -- if the points are not all independent, then the effective df will be slightly smaller. In principle one can do this with Satterthwaite or Kenward-Roger approximations, but I don't know if anyone's implemented them for lme models (pbkrtest implements them for lme4 models, but those don't allow temporal autocorrelation structures. Have you looked at the ACF() output to see if the temporal correlation structure is really necessary for your data?) However, I would be tempted to sweep this under the rug (as Crawley seems to; he doesn't mention df again when discussing autocorrelation structures). (I will also point out that is is **not** kosher in my opinion to post a public link to the entirety of a copyrighted (and non-open) work; it would be fair use, I think, to post a copy of a relevant page or two, or to point to it on Google Books <http://books.google.com/books?id=8D4HVx0apZQC&pg=PA644>.) @article{schielzeth_conclusions_2009, title = {Conclusions beyond support: overconfident estimates in mixed models}, volume = {20}, number = {2}, journal = {Behavioral Ecology}, author = {Schielzeth, Holger and Forstmeier, Wolfgang}, month = mar, year = {2009}, issn = {1045-2249, 1465-7279}, shorttitle = {Conclusions beyond support}, url = {http://beheco.oxfordjournals.org/content/20/2/416}, doi = {10.1093/beheco/arn145}, pages = {416--420}, }
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Hello - I found it remarkably easy to dowload your data and reproduce your analysis. Your efforts at making what you'd done thus easy to reproduce are to be commended! Compare your
Pop.Model2<-lme(Pop~Site+Habitat*Discharge, random=~1|ChannelUnit,data=mydata) summary(Pop.Model2)
Linear mixed-effects model fit by REML
Data: mydata
AIC BIC logLik
640.1269 661.5583 -310.0635
Random effects:
Formula: ~1 | ChannelUnit
(Intercept) Residual
StdDev: 13.87812 29.8971
. . .
with:
Pop.ModelD2<-lme(Pop~Site+Habitat*Discharge, random=~Discharge|ChannelUnit,data=mydata) summary(Pop.Model2)
Linear mixed-effects model fit by REML
Data: mydata
AIC BIC logLik
636.0179 661.7355 -306.009
Random effects:
Formula: ~Discharge | ChannelUnit
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 17.88118 (Intr)
Discharge 450.69768 -0.663
Residual 23.51211
. . .
I conclude that there is a very large random slope effect, The df
for testing the fixed effect (slope) of Discharge should then be 21,
not 42 as the nlme output suggests.
John Maindonald email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473 fax : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm
On 27/07/2012, at 5:55 AM, Stephanie Avery-Gomm wrote:
Hello, I am using *nlme* to do a mixed effect repeated measures ANCOVA, with two additional fixed factors but a limited sample size. *I am seeking clarification on how to/if I should adjust the inflated degrees of freedom for a within-subject factor as a way of dealing with the temporal pseudoreplication. *I am not using lmer so I am not sure the FAQ or Bates discussion re: adjusting df in lme4/lmer applies ( https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html).* *More information: At 3 Sites on a river I measured fish Population in approximately 9 stream Channel Units. Each Channel Unit was classified as a Habitat, with three levels (Glide, Riffle, Pool). I sampled each Channel Unit 3 times over the course of the summer, each time taking a Discharge Measurement (thus the exact Discharge differs a little from Site to Site, and so is a continuous variable). I want to know if fish Population in stream habitats (Glides, Riffles, Pools) changes as discharge decreases over the summer, if there an interaction and if fish Populations differ between habitats or between sites? The model I have settled on looks like this: Pop.Model<-lme(Pop~Site+Habitat*Discharge, random=~1|ChannelUnit, correlation=corCAR1(),data=mydata) Inclusion of the three repeated measurements of Population in each Channel Unit results in temporal pseudoreplication *and the degrees of freedom for the within-subjects factor (Discharge) is 42, but I only have 26 Channel Units, so this is obviously inflated (should be 21). I read in The R Book (Crawley: *(Pg. 644, https://www.dropbox.com/s/4zqewxl44btqmzo/Crawley%20The%20R%20book.pdf*) that I can fix this by specifying the degrees of freedom. But how?* * Although I?ve read a ton online, including Bates info re: SAS PROC Mixed versus R lmer and degrees of freedom I find that I am still quite confused. If anyone can offer specific advice on how I can adjust my degrees of freedom for the within-subjects factor in nlme or explain in accessible terms why I don?t need to, I would be very grateful. * Just in case I haven't provided enough information, here is my data and r code. .csv file: https://www.dropbox.com/s/2ijgq74di3hmo8i/R.Help.csv .R file: https://www.dropbox.com/s/puj5maifxc2rfcg/R%20Help.R .doc with code & diagram: https://www.dropbox.com/s/29dtofc62t957co/R%20Help.doc Sincerely, Stephanie Avery-Gomm MSc. Candidate, Zoology Department University of British Columbia -- -- Stephanie Avery-Gomm Master's Candidate Zoology Department, University of British Columbia #4200-6270 University Blvd. Vancouver, B.C. V6T 1Z4 Email: Stephanie.AveryGomm at gmail.com Cell: 778 322 3483 Web: http://ca.linkedin.com/in/stephanieaverygomm -- -- Stephanie Avery-Gomm Master's Candidate Zoology Department, University of British Columbia #4200-6270 University Blvd. Vancouver, B.C. V6T 1Z4 Email: Stephanie.AveryGomm at gmail.com Cell: 778 322 3483 Web: http://ca.linkedin.com/in/stephanieaverygomm [[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models