Meaning of /, :, and %in% in lmer - R-SIG-mixed-models

Fri, Apr 18, 2008 5:11 AM #

On 4/16/08, Claus Wilke <cwilke at mail.utexas.edu> wrote:

The first two, (1|A/B) and (1|A:B), are forms that lmer recognizes.
I'm not sure what the effect of the third form, (1|B %in% A), would be
and would not advise using it.

Most uses of the %in% operator in R at present are as a logical operator.

The short answer is that (1|A/B) is expanded to (1|A) + (1|A:B) so you
can choose whatever form makes sense to you.

There are different circumstances where a notation like (1|A/B) would
be used.  Some are reasonable choices and some are artifacts of
artificial ways of assigning labels to factor levels.  Rather than my
trying to guess what kind of application you have in mind, could you
describe a situation where you would want to fit an lmer model with
terms like that?

I am cc:ing the R-SIG-Mixed-Models list on this reply and I suggest we
move the discussion to that list.

Claus Wilke

Fri, Apr 18, 2008 9:47 AM #

Thanks, that was what I needed to hear.

It's a virology experiment. We have two ancestral strains. From each of those 
we have derived several new strains, and then have made multiple fitness 
measurements on the new strains. We want to know whether the ancestral strain 
has an effect on the fitness of the derived strains. The model I'm using for 
that is
	fitness ~ ancestor + (1|ancestor:strain),
because strains are nested within ancestors. If I were using
	fitness ~ ancestor + (1|ancestor/strain),
then ancestor would get both a fixed and a random effect, which doesn't make 
sense.

I have a second question, related to the hypothesis testing of whether the 
fixed ancestor effect is significant. I've read all the threads about why it 
is problematic to do an F test to calculate a p value, and that it is better 
to do markov-chain monte carlo. My question is: Is there a proper reference I 
can cite to substantiate the claim that the standard (i.e., SAS) way of 
calculating significance in this case is problematic, or do I have to refer 
to the mailing list archive?

Thanks a lot,
  Claus

Claus Wilke
Section of Integrative Biology 
 and Center for Computational Biology and Bioinformatics 
University of Texas at Austin
1 University Station C0930
Austin, TX 78712
cwilke at mail.utexas.edu
512 471 6028

Douglas Bates

Sat, Apr 19, 2008 11:58 AM #

On 4/18/08, Claus Wilke <cwilke at mail.utexas.edu> wrote:

The labeling question is related to the levels of the strain factor.
To me the sensible way to label strains is to give each unique strain
a unique label.  In fact, I would go so far as to say that is the only
sensible way.  So suppose the ancestral strains are called "A" and "B"
and there were 8 strains derived from "A" and 12 strains derived from
"B".  The I would give them labels like "A01" up to "A08" and "B01" up
to "B12".  Many people feel the strains from ancestor A should be
labeled 1 up to 8 and those from ancestor B labeled 1 up to 12 and
then incorporate the information that strain is nested within ancestor
somewhere in the model description.  To me this makes no sense.  If
strain 1 from ancestor A is not related in any way to strain 1 from
ancestor B, why call them both "1".

If the strains are labeled so that each unique strain has a unique
label then the model can be written as
  fitness ~ ancestor + (1|strain)
or as
  fitness ~ ancestor + (1|ancestor:strain)
whichever one makes sense to you.  If the levels of strain reflect an
implicit nesting (that is, you need to know that strain 1 from
ancestor A is not the same as strain 1 from ancestor B, even though
they are given the same level of strain) then you must write the model
in the second form but only because the labels of strain are ambiguous
and the expression ancestor:strain is required to disambiguate the
levels.

Harald Baayen's recent book on "Analyzing Linguistic Data" has a good
discussion of some of the issues in determining significance of
fixed-effects terms in a mixed-effects model.  I like some of the
explanations in his chapter 7.

To tell the truth I expect that the standard approach is reasonably
accurate for cases where the only random effects term in the model is
of the form  (1|strain); it's in the more complex models that the
simple approximations get off track.  The sort of data that Harald and
many others in psychometric areas consider is cross-classified
according to subject and item and the standard approaches get bogged
down there.