On 11/8/07, Marco Chiarandini <marco at imada.sdu.dk> wrote:
Dear Prof. Bates,
I am trying to use the function lmer from lme4 to analyse the following nested factorial design.
I have three treatment factors (neighborhood, initial, k); I have three group factors crossing (size, dens, inst).
Did you mean to write (size, dens, type) there? Also, by "factor" do you mean that you regard all of these variables as categorical? If so, you should check the form of the size variable in the data frame. It is being stored as a numeric variable, not as a factor. If you want to interpret this variable as a categorical factor you should convert it to a factor or, as seems likely in this case, an ordered factor. (See ?factor and ?ordered)
I have one random factor (Subject or, in my case, inst) nested within the groups generated by the crossing of the group factors.
Nesting is automatically handled appropriately in lmer as long as the levels of the inst factor are distinct. That is, if each distinct level of the inst factor has a distinct label, which also appears likely in this case because I see the first label contains G and 200 and I assume that these parts of the name correspond to the level of the type and size variables. If so, then you simply need to use inst as the grouping factor for the random effects. The only time that nesting must be explicitly stated is when the levels of the variable(s) at the inner level(s) are incomplete. I call this "implicit nesting". Suppose I choose 20 different plants from an experimental plot and extract several seeds from each plant then perform multiple analyses on each seed, I could label the plants "A", "B", "C", ...., "T" and the seeds "a", "b","c", ... for each plant. This is implicit nesting in that I only have, say, 12 seed labels but there may be one or two hundred seeds. To specify a particular seed I must not only specify its seed label but also the plant from which it came. If I just specify "seed" in a model formula I will get an inappropriate model fit. I need to somehow specify seed within plant. In my view this is not a characteristic of the experiment - it's just a dumb way of labeling the seeds. If you use labels like "Aa", "Ab", ..., as I suspect you have done for your "inst" factor, then the problem goes away.
I would like to write a formula for lmer of the kind: treat1 + treat2 + treat3 + group1 + group2 + group3 + Subject(group1*group2*group3) eventually plus some interaction terms. I am trying the following: mod <- lmer(err~k+initial+neighborhood+size+dens+type+(1|inst),data=Case3)
I think that specifcation corresponds to the model that you describe above, although I am not quite sure what the distinction between a treatment factor and a group factor is.
> summary(mod)
Linear mixed-effects model fit by REML
Formula: err ~ k + initial + neighborhood + size +
dens + type + (1 | inst)
Data: Case3
AIC BIC logLik MLdeviance REMLdeviance
16472 16542 -8224 16433 16448
Random effects:
Groups Name Variance Std.Dev.
inst (Intercept) 2.61 1.62
Residual 49.23 7.02
number of obs: 2430, groups: inst, 90
It appears that the random effect for the inst factor may be unnecessary. You may want to check what the log-likelihood for a model with only fixed-effects is.
> anova(mod)
Analysis of Variance Table
Df Sum Sq Mean Sq
k 2 5473 2736
initial 2 72168 36084
neighborhood 2 18622 9311
size 1 52 52
dens 2 2 1
type 1 31378 31378
But results are very different from what I get in SAS
I can't really comment on that without knowing how you specified the model in SAS and what analysis of variance results from SAS you are comparing. This analysis of variance table, like most such tables in R, is the decomposition of the variation in the response according to the terms in the order they were given in the formula. As Bill Venables describes in his famous (and, regretably, unpublished) paper "Exegeses on linear models" (just search for the title in a search engine) this is the only decomposition that makes sense but that does not deter many people, including the authors of SAS, from creating other decompositions that may on the surface appear to make sense but do not withstand careful scrutiny. It appears that you may have a completely balanced experiment here (I'm guessing that it is a computer experiment) in which case the decomposition is invariant to reordering of the terms in the model. In general, if type, size and dens are blocking factors or environmental factors (that is, they represent a known source of variability and you wish to control for these factors in examining your experimental factors) then they should be entered first in the formula.
and I cannot figure out why the anova method does not return the test for significance.
Ah, that's a long story. One can calculate F-ratios for fixed-effects terms in a linear mixed model but they don't have an F distribution except in certain balanced cases. Determining a p-value for a fixed-effects term in a mixed model fit to unbalanced data is not trivial. At one time I did list p-values in such a table but they were approximations and rather coarse approximations that erred in the wrong direction. Some users, quite reasonably, objected that these could be dangerously misleading so my current solution is not to return a p-value at all.
If I try this formula, the computation never stops: mod <- lmer(err~initial+neighborhood+k+(1|inst) + (inst|type)+(inst|size) + (inst|dens),data=Case3)
You really, really don't want to try that. The expression on the left hand side of a random-effects term is treated as a linear model formula from which a model matrix is evaluated. The model matrix for inst has 90 columns so each level of type is being modeled as having 90, possibly correlated, random effects associated with it. The same for size and dens. 90 correlated random effects requires estimation of 90 variance parameters and 4005 covariance parameters. The general rule is that a factor on the left hand side of the '|' should have a very small number of levels whereas a factor on the right hand side should have a large number of levels.
Which should be the right formula? In case, this is the structure of my data:
> str(Case3)
'data.frame': 2430 obs. of 8 variables: $ err : num 12.60 11.03 9.99 2.48 2.48 ... $ initial : Factor w/ 3 levels "greedy_cov","lightest_add",..: 2 2 2 1 1 1 3 3 3 2 ... $ neighborhood: Factor w/ 3 levels "k-add","k-cov",..: 1 1 1 1 1 1 1 1 1 2 ... $ k : Factor w/ 3 levels "1","3","5": 1 2 3 1 2 3 1 2 3 1 ... $ type : Factor w/ 2 levels "G","U": 1 1 1 1 1 1 1 1 1 1 ... $ size : num 200 200 200 200 200 200 200 200 200 200 ... $ dens : Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 1 ... $ inst : Factor w/ 90 levels "G-200-0.1-1",..: 1 1 1 1 1 1 1 1 1 1 ... Thank you in advance for any suggestion you may provide. Best regards, Marco -- Marco Chiarandini http://www.imada.sdu.dk/~marco Department of Mathematics Email: marco at imada.sdu.dk and Computer Science, Phone: +45 6550 4031 University of Southern Denmark Fax: +45 6593 2691