An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20130223/5f1ee0c5/attachment.pl>
Nested error term and unbalanced design
5 messages · Erica Newman, Ben Bolker, Baldwin, Jim -FS
While there is a definite order to family, genus, and species (no pun intended), I think that the "nestedness" (if any) would be related to how you selected your sampling units rather than the fixed effects of family, genus, and species. (I admit bias in rarely if ever considering species as a random effect.) Jim -----Original Message----- From: r-sig-mixed-models-bounces at r-project.org [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of Erica Newman Sent: Saturday, February 23, 2013 2:21 PM To: r-sig-mixed-models at r-project.org Subject: [R-sig-ME] Nested error term and unbalanced design I am trying to run a model that incorporates both environmental variables and taxonomic relationships, and I am unsure if I am 1) specifying the error term correctly, and 2) accounting for unbalanced data correctly. I would appreciate any guidance you can provide. As a simplified example, I want to ask if a bird is more likely to be carrying ticks based on the habitat it was caught in, and what kind of bird it is (my actual model has many more environmental variables). We have many related species in multiple genera in multiple families, but all in the same order. Species is nested within genus, and genus is nested within family. I want to estimate a fixed effect for both habitat and species, while accounting for the nestedness of the relationships of the birds, and I also want to account for the fact that we caught more of certain species than others. My simplified model looks like this: M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS/SPECIES), family=binomial(link="logit")) where y is a column vector of (tick presence, tick absence) So my questions are: is this the correct "grammar" for the nested error? and does the nested error structure by itself take into account the unbalanced data structure? Thank you in advance for your time. Sincerely, Erica Newman _______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
1 day later
Baldwin, Jim -FS <jbaldwin at ...> writes:
While there is a definite order to family, genus, and species (no pun intended), I think that the "nestedness" (if any) would be related to how you selected your sampling units rather than the fixed effects of family, genus, and species. (I admit bias in rarely if ever considering species as a random effect.)
Jim
I think I respectfully disagree ... see below ...
I am trying to run a model that incorporates both environmental variables and taxonomic relationships, and I am unsure if I am 1) specifying the error term correctly, and 2) accounting for unbalanced data correctly. I would appreciate any guidance you can provide.
As a simplified example, I want to ask if a bird is more likely to be carrying ticks based on the habitat it was caught in, and what kind of bird it is (my actual model has many more environmental variables). We have many related species in multiple genera in multiple families, but all in the same order. Species is nested within genus, and genus is nested within family. I want to estimate a fixed effect for both habitat and species, while accounting for the nestedness of the relationships of the birds, and I also want to account for the fact that we caught more of certain species than others.
My simplified model looks like this: M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS/SPECIES), family=binomial(link="logit")) where y is a column vector of (tick presence, tick absence) So my questions are: is this the correct "grammar" for the nested error? and does the nested error structure by itself take into account the unbalanced data structure?
Generally you don't have to worry about lack of balance in
'modern' mixed models unless it's really extreme.
I'm having a little bit of a hard time conceptually with the
idea of having species as a fixed effect _and_ having the
variances of family and genus be random. You certainly
shouldn't have a categorical predictor (SPECIES) appear as both
a random and a fixed effect, though.
M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS),
family=binomial(link="logit"))
*might* work (I would give it a try and see if the results are sensible).
I would also consider
M1 <- lmer(y ~ HABITAT + (HABITAT|FAMILY/GENUS/SPECIES),
family=binomial(link="logit"))
if your data set is big enough to support it. This allows for habitat
to have different effects on different species ... (see a paper
by Schielzeth and Forstmeier on the importance of including interactions
between fixed and random effects:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2657178/ )
I think someone wise said "When you find yourself in a hole, first put down the shovel." Someday I'll learn that. (Maybe today.) What follows is likely from my lack of biological (and maybe statistical) knowledge. The setup seems to be that individual birds (classified as to their species and habitat) are checked for the presence of ticks. For each species and habitat combination there is a proportion of birds with ticks. Each species is also classified as to genus and family. It is of interest to see if there are differences among genus and family classifications. I see everything as a fixed effect in this case. I see no random effects or a relevant variance component as I can't imagine that for any genus and family that there is actually a random sample from all species within that family (especially if there are only a small number of species within a particular family to select from). If a family (either within a habitat type or across habitat types) is to be compared to another family, it would seem that the first comparison would be among the mean of the species proportions (or maybe the mean of the logits or probits) for each family). Next it is conceivable that one might want to know if the variability of the species within a family varies among families. That could be done by defining/declaring the summary statistic of interest to be the variance of the "true" proportions within a family and one would use the sample data to estimate those variances. But these variances would be as summary statistics rather than a variance component essential to the definition of the model. The underlying model would simply be the number of birds with ticks following a binomial distribution with the proportion of birds with ticks being a function of species and habitat. I agree with the article you mentioned concerning the use of random coefficient models. I just don't see treating species as a randomly selected subject from a family of species. (Maybe treating insect species as a randomly selected species within a family where there are zillions of species but not for critters much higher up the food chain.) Jim -----Original Message----- From: r-sig-mixed-models-bounces at r-project.org [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of Ben Bolker Sent: Monday, February 25, 2013 7:27 AM To: r-sig-mixed-models at r-project.org Subject: Re: [R-sig-ME] Nested error term and unbalanced design Baldwin, Jim -FS <jbaldwin at ...> writes:
While there is a definite order to family, genus, and species (no pun intended), I think that the "nestedness" (if any) would be related to how you selected your sampling units rather than the fixed effects of family, genus, and species. (I admit bias in rarely if ever considering species as a random effect.)
Jim
I think I respectfully disagree ... see below ...
I am trying to run a model that incorporates both environmental variables and taxonomic relationships, and I am unsure if I am 1) specifying the error term correctly, and 2) accounting for unbalanced data correctly. I would appreciate any guidance you can provide.
As a simplified example, I want to ask if a bird is more likely to be carrying ticks based on the habitat it was caught in, and what kind of bird it is (my actual model has many more environmental variables). We have many related species in multiple genera in multiple families, but all in the same order. Species is nested within genus, and genus is nested within family. I want to estimate a fixed effect for both habitat and species, while accounting for the nestedness of the relationships of the birds, and I also want to account for the fact that we caught more of certain species than others.
My simplified model looks like this: M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS/SPECIES), family=binomial(link="logit")) where y is a column vector of (tick presence, tick absence) So my questions are: is this the correct "grammar" for the nested error? and does the nested error structure by itself take into account the unbalanced data structure?
Generally you don't have to worry about lack of balance in 'modern' mixed models unless it's really extreme.
I'm having a little bit of a hard time conceptually with the idea of having species as a fixed effect _and_ having the variances of family and genus be random. You certainly shouldn't have a categorical predictor (SPECIES) appear as both a random and a fixed effect, though.
M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS),
family=binomial(link="logit"))
*might* work (I would give it a try and see if the results are sensible).
I would also consider
M1 <- lmer(y ~ HABITAT + (HABITAT|FAMILY/GENUS/SPECIES),
family=binomial(link="logit"))
if your data set is big enough to support it. This allows for habitat to have different effects on different species ... (see a paper by Schielzeth and Forstmeier on the importance of including interactions between fixed and random effects:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2657178/ )
_______________________________________________
R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
On 13-02-25 12:55 PM, Baldwin, Jim -FS wrote:
I think someone wise said "When you find yourself in a hole, first put down the shovel." Someday I'll learn that. (Maybe today.) What follows is likely from my lack of biological (and maybe statistical) knowledge. The setup seems to be that individual birds (classified as to their species and habitat) are checked for the presence of ticks. For each species and habitat combination there is a proportion of birds with ticks. Each species is also classified as to genus and family. It is of interest to see if there are differences among genus and family classifications. I see everything as a fixed effect in this case. I see no random effects or a relevant variance component as I can't imagine that for any genus and family that there is actually a random sample from all species within that family (especially if there are only a small number of species within a particular family to select from).
I have a different definition of random effects, more along the pragmatic/Bayesian than the philosophical/frequentist (this is discussed at more length at http://glmm.wikidot.com/faq ). In essence, I make the distinction between fixed and random effects more on the criteria * is it useful to estimate these parameters with shrinkage? (yes=random) and * would I rather have the ability to extrapolate to unmeasured units/make inferences about the variation among units (random) or to make inferential statements about differences between particular sets of units (fixed)? I do *not* make much use of the experimental-design criterion (were these units selected randomly, or could they have been selected randomly, from a larger set of values)? So I see no problem in treating family/genus/species as random effects. Opinions differ, though.
If a family (either within a habitat type or across habitat types) is to be compared to another family, it would seem that the first comparison would be among the mean of the species proportions (or maybe the mean of the logits or probits) for each family). Next it is conceivable that one might want to know if the variability of the species within a family varies among families. That could be done by defining/declaring the summary statistic of interest to be the variance of the "true" proportions within a family and one would use the sample data to estimate those variances. But these variances would be as summary statistics rather than a variance component essential to the definition of the model. The underlying model would simply be the number of birds with ticks following a binomial distribution with the proportion of birds with ticks being a function of species and habitat.
This is a sensible question, but hard to set up within lme4. The random effects coded in lme4 (and in most GLMMs) quantify whether the mean (on the link scale = logit/probit/etc.) differs among units, not whether the variation differs. You could do this in AD Model Builder/WinBUGS/Stan/etc. (I think this has been discussed before on the list.)
I agree with the article you mentioned concerning the use of random coefficient models. I just don't see treating species as a randomly selected subject from a family of species. (Maybe treating insect species as a randomly selected species within a family where there are zillions of species but not for critters much higher up the food chain.) Jim -----Original Message----- From: r-sig-mixed-models-bounces at r-project.org [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of Ben Bolker Sent: Monday, February 25, 2013 7:27 AM To: r-sig-mixed-models at r-project.org Subject: Re: [R-sig-ME] Nested error term and unbalanced design Baldwin, Jim -FS <jbaldwin at ...> writes:
While there is a definite order to family, genus, and species (no pun intended), I think that the "nestedness" (if any) would be related to how you selected your sampling units rather than the fixed effects of family, genus, and species. (I admit bias in rarely if ever considering species as a random effect.)
Jim
I think I respectfully disagree ... see below ...
I am trying to run a model that incorporates both environmental variables and taxonomic relationships, and I am unsure if I am 1) specifying the error term correctly, and 2) accounting for unbalanced data correctly. I would appreciate any guidance you can provide.
As a simplified example, I want to ask if a bird is more likely to be carrying ticks based on the habitat it was caught in, and what kind of bird it is (my actual model has many more environmental variables). We have many related species in multiple genera in multiple families, but all in the same order. Species is nested within genus, and genus is nested within family. I want to estimate a fixed effect for both habitat and species, while accounting for the nestedness of the relationships of the birds, and I also want to account for the fact that we caught more of certain species than others.
My simplified model looks like this: M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS/SPECIES), family=binomial(link="logit")) where y is a column vector of (tick presence, tick absence) So my questions are: is this the correct "grammar" for the nested error? and does the nested error structure by itself take into account the unbalanced data structure?
Generally you don't have to worry about lack of balance in 'modern' mixed models unless it's really extreme.
I'm having a little bit of a hard time conceptually with the idea of having species as a fixed effect _and_ having the variances of family and genus be random. You certainly shouldn't have a categorical predictor (SPECIES) appear as both a random and a fixed effect, though.
M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS), family=binomial(link="logit")) *might* work (I would give it a try and see if the results are sensible). I would also consider M1 <- lmer(y ~ HABITAT + (HABITAT|FAMILY/GENUS/SPECIES), family=binomial(link="logit")) if your data set is big enough to support it. This allows for habitat to have different effects on different species ... (see a paper by Schielzeth and Forstmeier on the importance of including interactions between fixed and random effects: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2657178/ )
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.