Nested error term and unbalanced design

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20130223/5f1ee0c5/attachment.pl>
While there is a definite order to family, genus, and species (no pun intended), I think that the "nestedness" (if any) would be related to how you selected your sampling units rather than the fixed effects of family, genus, and species.  (I admit bias in rarely if ever considering species as a random effect.)

Jim

-----Original Message-----
From: r-sig-mixed-models-bounces at r-project.org [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of Erica Newman
Sent: Saturday, February 23, 2013 2:21 PM
To: r-sig-mixed-models at r-project.org
Subject: [R-sig-ME] Nested error term and unbalanced design

I am trying to run a model that incorporates both environmental variables and taxonomic relationships, and I am unsure if I am 1) specifying the error term correctly, and 2) accounting for unbalanced data correctly. I would appreciate any guidance you can provide.

As a simplified example, I want to ask if a bird is more likely to be carrying ticks based on the habitat it was caught in, and what kind of bird it is (my actual model has many more environmental variables). We have many related species in multiple genera in multiple families, but all in the same order. Species is nested within genus, and genus is nested within family. I want to estimate a fixed effect for both habitat and species, while accounting for the nestedness of the relationships of the birds, and I also want to account for the fact that we caught more of certain species than others.

My simplified model looks like this:

M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS/SPECIES),
family=binomial(link="logit"))

where y is a column vector of (tick presence, tick absence)

So my questions are: is this the correct "grammar" for the nested error?
and does the nested error structure by itself take into account the unbalanced data structure?

Thank you in advance for your time.

Sincerely,

Erica Newman

_______________________________________________
R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
Baldwin, Jim -FS <jbaldwin at ...> writes:
 While there is a definite order to family, genus, and species (no
pun intended), I think that the "nestedness" (if any) would be
related to how you selected your sampling units rather than the
fixed effects of family, genus, and species.  (I admit bias in
rarely if ever considering species as a random effect.)
Jim
I think I respectfully disagree ... see below ...
I am trying to run a model that incorporates both environmental
variables and taxonomic relationships, and I am unsure if I am 1)
specifying the error term correctly, and 2) accounting for
unbalanced data correctly. I would appreciate any guidance you can
provide.
As a simplified example, I want to ask if a bird is more likely to
be carrying ticks based on the habitat it was caught in, and what
kind of bird it is (my actual model has many more environmental
variables). We have many related species in multiple genera in
multiple families, but all in the same order. Species is nested
within genus, and genus is nested within family. I want to estimate
a fixed effect for both habitat and species, while accounting for
the nestedness of the relationships of the birds, and I also want to
account for the fact that we caught more of certain species than
others.
My simplified model looks like this:

M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS/SPECIES),
family=binomial(link="logit"))

where y is a column vector of (tick presence, tick absence)

So my questions are: is this the correct "grammar" for the nested error?
and does the nested error structure by itself take into account the
 unbalanced data structure?
Generally you don't have to worry about lack of balance in
'modern' mixed models unless it's really extreme.

  I'm having a little bit of a hard time conceptually with the
idea of having species as a fixed effect _and_ having the 
variances of family and genus be random.  You certainly
shouldn't have a categorical predictor (SPECIES) appear as both 
a random and a fixed effect, though.

M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS),
     family=binomial(link="logit"))

*might* work (I would give it a try and see if the results are sensible).
I would also consider

M1 <- lmer(y ~ HABITAT + (HABITAT|FAMILY/GENUS/SPECIES),
     family=binomial(link="logit"))

if your data set is big enough to support it.  This allows for habitat
to have different effects on different species ... (see a paper
by Schielzeth and Forstmeier on the importance of including interactions
between fixed and random effects:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2657178/ )
I think someone wise said "When you find yourself in a hole, first put down the shovel."  Someday I'll learn that.  (Maybe today.)  What follows is likely from my lack of biological (and maybe statistical) knowledge.

The setup seems to be that individual birds (classified as to their species and habitat) are checked for the presence of ticks.  For each species and habitat combination there is a proportion of birds with ticks.  Each species is also classified as to genus and family.  It is of interest to see if there are differences among genus and family classifications.  I see everything as a fixed effect in this case.

I see no random effects or a relevant variance component as I can't imagine that for any genus and family that there is actually a random sample from all species within that family (especially if there are only a small number of species within a particular family to select from).

If a family (either within a habitat type or across habitat types) is to be compared to another family, it would seem that the first comparison would be among the mean of the species proportions (or maybe the mean of the logits or probits) for each family).

Next it is conceivable that one might want to know if the variability of the species within a family varies among families.  That could be done by defining/declaring the summary statistic of interest to be the variance of the "true" proportions within a family and one would use the sample data to estimate those variances.  But these variances would be as summary statistics rather than a variance component essential to the definition of the model.  The underlying model would simply be the number of birds with ticks following a binomial distribution with the proportion of birds with ticks being a function of species and habitat.

I agree with the article you mentioned concerning the use of random coefficient models.  I just don't see treating species as a randomly selected subject from a family of species.  (Maybe treating insect species as a randomly selected species within a family where there are zillions of species but not for critters much higher up the food chain.)

Jim

-----Original Message-----
From: r-sig-mixed-models-bounces at r-project.org [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of Ben Bolker
Sent: Monday, February 25, 2013 7:27 AM
To: r-sig-mixed-models at r-project.org
Subject: Re: [R-sig-ME] Nested error term and unbalanced design

Baldwin, Jim -FS <jbaldwin at ...> writes:
 While there is a definite order to family, genus, and species (no pun
intended), I think that the "nestedness" (if any) would be related to
how you selected your sampling units rather than the fixed effects of
family, genus, and species.  (I admit bias in rarely if ever
considering species as a random effect.)
Jim
I think I respectfully disagree ... see below ...
I am trying to run a model that incorporates both environmental
variables and taxonomic relationships, and I am unsure if I am 1)
specifying the error term correctly, and 2) accounting for unbalanced
data correctly. I would appreciate any guidance you can provide.
As a simplified example, I want to ask if a bird is more likely to be
carrying ticks based on the habitat it was caught in, and what kind of
bird it is (my actual model has many more environmental variables). We
have many related species in multiple genera in multiple families, but
all in the same order. Species is nested within genus, and genus is
nested within family. I want to estimate a fixed effect for both
habitat and species, while accounting for the nestedness of the
relationships of the birds, and I also want to account for the fact
that we caught more of certain species than others.
My simplified model looks like this:

M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS/SPECIES),
family=binomial(link="logit"))

where y is a column vector of (tick presence, tick absence)

So my questions are: is this the correct "grammar" for the nested error?
and does the nested error structure by itself take into account the
unbalanced data structure?
Generally you don't have to worry about lack of balance in 'modern' mixed models unless it's really extreme.

  I'm having a little bit of a hard time conceptually with the idea of having species as a fixed effect _and_ having the variances of family and genus be random.  You certainly shouldn't have a categorical predictor (SPECIES) appear as both a random and a fixed effect, though.

M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS),
     family=binomial(link="logit"))

*might* work (I would give it a try and see if the results are sensible).
I would also consider

M1 <- lmer(y ~ HABITAT + (HABITAT|FAMILY/GENUS/SPECIES),
     family=binomial(link="logit"))

if your data set is big enough to support it.  This allows for habitat to have different effects on different species ... (see a paper by Schielzeth and Forstmeier on the importance of including interactions between fixed and random effects:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2657178/ )

_______________________________________________
R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.
I think someone wise said "When you find yourself in a hole, first
put down the shovel."  Someday I'll learn that.  (Maybe today.)  What
follows is likely from my lack of biological (and maybe statistical)
knowledge.

The setup seems to be that individual birds (classified as to their
species and habitat) are checked for the presence of ticks.  For each
species and habitat combination there is a proportion of birds with
ticks.  Each species is also classified as to genus and family.  It
is of interest to see if there are differences among genus and family
classifications.  I see everything as a fixed effect in this case.

I see no random effects or a relevant variance component as I can't
imagine that for any genus and family that there is actually a random
sample from all species within that family (especially if there are
only a small number of species within a particular family to select
from).
I have a different definition of random effects, more along the
pragmatic/Bayesian than the philosophical/frequentist (this is discussed
at more length at http://glmm.wikidot.com/faq ).  In essence, I make the
distinction between fixed and random effects more on the criteria

 * is it useful to estimate these parameters with shrinkage? (yes=random)

and

 * would I rather have the ability to extrapolate to unmeasured
units/make inferences about the variation among units (random) or to
make inferential statements about differences between particular sets of
units (fixed)?

 I do *not* make much use of the experimental-design criterion (were
these units selected randomly, or could they have been selected
randomly, from a larger set of values)?

  So I see no problem in treating family/genus/species as random
effects.  Opinions differ, though.
If a family (either within a habitat type or across habitat types) is
to be compared to another family, it would seem that the first
comparison would be among the mean of the species proportions (or
maybe the mean of the logits or probits) for each family).

Next it is conceivable that one might want to know if the variability
of the species within a family varies among families.  That could be
done by defining/declaring the summary statistic of interest to be
the variance of the "true" proportions within a family and one would
use the sample data to estimate those variances.  But these variances
would be as summary statistics rather than a variance component
essential to the definition of the model.  The underlying model would
simply be the number of birds with ticks following a binomial
distribution with the proportion of birds with ticks being a function
of species and habitat.
This is a sensible question, but hard to set up within lme4.  The
random effects coded in lme4 (and in most GLMMs) quantify whether the
mean (on the link scale = logit/probit/etc.) differs among units, not
whether the variation differs.  You could do this in AD Model
Builder/WinBUGS/Stan/etc.  (I think this has been discussed before on
the list.)
I agree with the article you mentioned concerning the use of random
coefficient models.  I just don't see treating species as a randomly
selected subject from a family of species.  (Maybe treating insect
species as a randomly selected species within a family where there
are zillions of species but not for critters much higher up the food
chain.)

Jim

-----Original Message----- From:
r-sig-mixed-models-bounces at r-project.org
[mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of Ben
Bolker Sent: Monday, February 25, 2013 7:27 AM To:
r-sig-mixed-models at r-project.org Subject: Re: [R-sig-ME] Nested error
term and unbalanced design

Baldwin, Jim -FS <jbaldwin at ...> writes:

While there is a definite order to family, genus, and species (no
pun intended), I think that the "nestedness" (if any) would be
related to how you selected your sampling units rather than the
fixed effects of family, genus, and species.  (I admit bias in
rarely if ever considering species as a random effect.)

Jim
I think I respectfully disagree ... see below ...

I am trying to run a model that incorporates both environmental 
variables and taxonomic relationships, and I am unsure if I am 1) 
specifying the error term correctly, and 2) accounting for
unbalanced data correctly. I would appreciate any guidance you can
provide.

As a simplified example, I want to ask if a bird is more likely to
be carrying ticks based on the habitat it was caught in, and what
kind of bird it is (my actual model has many more environmental
variables). We have many related species in multiple genera in
multiple families, but all in the same order. Species is nested
within genus, and genus is nested within family. I want to estimate
a fixed effect for both habitat and species, while accounting for
the nestedness of the relationships of the birds, and I also want
to account for the fact that we caught more of certain species than
others.

My simplified model looks like this:

M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS/SPECIES), 
family=binomial(link="logit"))

where y is a column vector of (tick presence, tick absence)

So my questions are: is this the correct "grammar" for the nested
error? and does the nested error structure by itself take into
account the unbalanced data structure?
Generally you don't have to worry about lack of balance in 'modern'
mixed models unless it's really extreme.

I'm having a little bit of a hard time conceptually with the idea of
having species as a fixed effect _and_ having the variances of family
and genus be random.  You certainly shouldn't have a categorical
predictor (SPECIES) appear as both a random and a fixed effect,
though.
M1 <- lmer(y ~ HABITAT + SPECIES + (1|FAMILY/GENUS), 
family=binomial(link="logit"))

*might* work (I would give it a try and see if the results are
sensible). I would also consider

M1 <- lmer(y ~ HABITAT + (HABITAT|FAMILY/GENUS/SPECIES), 
family=binomial(link="logit"))

if your data set is big enough to support it.  This allows for
habitat to have different effects on different species ... (see a
paper by Schielzeth and Forstmeier on the importance of including
interactions between fixed and random effects: 
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2657178/ )

_______________________________________________ 
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

This electronic message contains information generated by the USDA
solely for the intended recipients. Any unauthorized interception of
this message or the use or disclosure of the information it contains
may violate the law and subject the violator to civil or criminal
penalties. If you believe you have received this message in error,
please notify the sender and delete the email immediately.