Message-ID: <40e66e0b0909260543k6d7621a2o21eec5402ae8f10a@mail.gmail.com>
Date: 2009-09-26T12:43:04Z
From: Douglas Bates
Subject: Data sheet notation and model structure for GLMM with 3 non-factorial factors
In-Reply-To: <30406dd0909260111u41efa321nfa4c1766f255eebf@mail.gmail.com>
Sat, Sep 26, 2009 at 3:11 AM, Raldo Kruger <raldo.kruger at gmail.com> wrote:
> Hi Douglas,
> Many thanks for the input. I've run two analyses on the same dataset
> using 1) indicator columns and the 2) a single 'factor / treatment'
> column for the non-factorial design described in my previous e-mail,
> and the results were identical (great!).
> However, I did the same for a dataset with a factorial design (N, G,
> N*G, i.e. there were plots with N, plots with G, and plots with both N
> and G), and the results for the main effects are identical, but the
> estimates for the interaction effects (N*G) are different between the
> two analyses (see below). Could you help me make sense of that please
> (i.e. which one is correct?) !
Generally when you have the possibility of having N and G combined you
would treat the design as a two-factor two-level factorial. That is,
one factor for presence or absence of G and another factor for
presence or absence of N. You could treat it as a single factor with
four levels (neither, G only, N only and both N and G) but, as you
have seen you need to translate between the representations.
In the two-factor, two-level factorial design, let a be the estimate
of the main effect for G, b be the estimate of the main effect for N,
and c be the interaction estimate. In your example a = 0.14929, b =
0.03766 and c = -0.31633. Then the estimated cell mean for the NG
cell is a + b + c =
> 0.03766 + 0.14929 + (-0.31633)
[1] -0.12938
> Thanks,
> Raldo
>
> With expanded treatment notation-
> Fixed effects:
> ? ? ? ? ? ? ?Estimate Std. Error z value Pr(>|z|)
> (Intercept) ? ?2.92060 ? ?0.23834 ?12.254 ?< 2e-16 ***
> N ? ? ? ? ? ? ?0.03766 ? ?0.03486 ? 1.080 ? 0.2801
> G ? ? ? ? ? ? ?0.14929 ? ?0.03395 ? 4.397 1.10e-05 ***
> Yearthree ? ? -2.85449 ? ?0.10664 -26.768 ?< 2e-16 ***
> Yeartwo ? ? ? -1.88175 ? ?0.06844 -27.494 ?< 2e-16 ***
> N:G ? ? ? ? ? -0.31633 ? ?0.04953 ?-6.386 1.70e-10 ***
> N:Yearthree ? ?0.15710 ? ?0.14428 ? 1.089 ? 0.2762
> N:Yeartwo ? ? ?0.14736 ? ?0.09305 ? 1.584 ? 0.1133
> G:Yearthree ? -0.25107 ? ?0.15430 ?-1.627 ? 0.1037
> G:Yeartwo ? ? ?0.07550 ? ?0.09200 ? 0.821 ? 0.4118
> N:G:Yearthree ?0.36353 ? ?0.20810 ? 1.747 ? 0.0807 .
> N:G:Yeartwo ? -0.01158 ? ?0.12996 ?-0.089 ? 0.9290
>
> With single column treatment notation-
> Fixed effects:
> ? ? ? ? ? ? ? ? ? Estimate Std. Error z value Pr(>|z|)
> (Intercept) ? ? ? ? 2.92057 ? ?0.23836 ?12.253 ?< 2e-16 ***
> TreatG ? ? ? ? ? ? ?0.14928 ? ?0.03395 ? 4.397 1.10e-05 ***
> TreatN ? ? ? ? ? ? ?0.03767 ? ?0.03486 ? 1.080 0.279928
> TreatNG ? ? ? ? ? ?-0.12938 ? ?0.03639 ?-3.556 0.000377 ***
> Yearthree ? ? ? ? ?-2.85448 ? ?0.10664 -26.768 ?< 2e-16 ***
> Yeartwo ? ? ? ? ? ?-1.88175 ? ?0.06844 -27.494 ?< 2e-16 ***
> TreatG:Yearthree ? -0.25109 ? ?0.15430 ?-1.627 0.103693
> TreatN:Yearthree ? ?0.15711 ? ?0.14428 ? 1.089 0.276199
> TreatNG :Yearthree ?0.26959 ? ?0.14636 ? 1.842 0.065483 .
> TreatG:Yeartwo ? ? ?0.07549 ? ?0.09200 ? 0.820 0.411941
> TreatN:Yeartwo ? ? ?0.14735 ? ?0.09305 ? 1.583 0.113308
> TreatNG :Yeartwo ? ?0.21118 ? ?0.09558 ? 2.210 0.027139 *
>
>
> On Thu, Sep 24, 2009 at 2:10 PM, Douglas Bates <bates at stat.wisc.edu> wrote:
>> On Thu, Sep 24, 2009 at 1:22 AM, Raldo Kruger <raldo.kruger at gmail.com> wrote:
>>> Hi R users,
>>>
>>> I have 3 factors in a non-factorial design (G, K and N), as well as
>>> two time periods (Year) and a random factor (Site), with Plant numbers
>>> as the response variable.
>>>
>>> My 1st question relates to the the notation of the treatments in the
>>> data frame. Is it appropriate to use an expanded treatment notation,
>>> such as this, when using glmer{lme4}:
>>>
>>> Site ? ?Year ? ?Plant ? G ? ? ? K ? ? ? N
>>> A ? ? ? 1 ? ? ? 5 ? ? ? 0 ? ? ? 0 ? ? ? 0
>>> A ? ? ? 1 ? ? ? 4 ? ? ? 1 ? ? ? 0 ? ? ? 0
>>> A ? ? ? 1 ? ? ? 7 ? ? ? 0 ? ? ? 1 ? ? ? 0
>>> A ? ? ? 1 ? ? ? 10 ? ? ?0 ? ? ? 0 ? ? ? 1
>>> A ? ? ? 2 ? ? ? 3 ? ? ? 0 ? ? ? 0 ? ? ? 0
>>> A ? ? ? 2 ? ? ? 4 ? ? ? 1 ? ? ? 0 ? ? ? 0
>>> A ? ? ? 2 ? ? ? 8 ? ? ? 0 ? ? ? 1 ? ? ? 0
>>> A ? ? ? 2 ? ? ? 12 ? ? ?0 ? ? ? 0 ? ? ? 1
>>> B ? ? ? 1 ? ? ? 7 ? ? ? 0 ? ? ? 0 ? ? ? 0
>>> B ? ? ? 1 ? ? ? 3 ? ? ? 1 ? ? ? 0 ? ? ? 0
>>> B ? ? ? 1 ? ? ? 7 ? ? ? 0 ? ? ? 1 ? ? ? 0
>>> B ? ? ? 1 ? ? ? 12 ? ? ?0 ? ? ? 0 ? ? ? 1
>>> B ? ? ? 2 ? ? ? 4 ? ? ? 0 ? ? ? 0 ? ? ? 0
>>> B ? ? ? 2 ? ? ? 5 ? ? ? 1 ? ? ? 0 ? ? ? 0
>>> B ? ? ? 2 ? ? ? 6 ? ? ? 0 ? ? ? 1 ? ? ? 0
>>> B ? ? ? 2 ? ? ? 11 ? ? ?0 ? ? ? 0 ? ? ? 1
>>>
>>> With the model
>>>
>>> m1<-glmer(Plant~G+K+N+Year+(1|Site), ...)
>>>
>>> Or is it better to use a single column for the treatments, like this:
>>>
>>> Site ? ?Year ? ?Plant ? Treatment
>>> A ? ? ? 1 ? ? ? 5 ? ? ? C
>>> A ? ? ? 1 ? ? ? 4 ? ? ? G
>>> A ? ? ? 1 ? ? ? 7 ? ? ? K
>>> A ? ? ? 1 ? ? ? 10 ? ? ?N
>>> A ? ? ? 2 ? ? ? 3 ? ? ? C
>>> A ? ? ? 2 ? ? ? 4 ? ? ? G
>>> A ? ? ? 2 ? ? ? 8 ? ? ? K
>>> A ? ? ? 2 ? ? ? 12 ? ? ?N
>>> B ? ? ? 1 ? ? ? 7 ? ? ? C
>>> B ? ? ? 1 ? ? ? 3 ? ? ? G
>>> B ? ? ? 1 ? ? ? 7 ? ? ? K
>>> B ? ? ? 1 ? ? ? 12 ? ? ?N
>>> B ? ? ? 2 ? ? ? 4 ? ? ? C
>>> B ? ? ? 2 ? ? ? 5 ? ? ? G
>>> B ? ? ? 2 ? ? ? 6 ? ? ? K
>>> B ? ? ? 2 ? ? ? 11 ? ? ?N
>>>
>>> With the following model:
>>> m1<-glmer(Plants~Treatment+Year+(1|Site), ...)
>>
>> The latter is preferred. ?R will generate the indicator columns for
>> the levels of the Treatment factor (the 0/1 columns shown in the first
>> form) and, when appropriate, reduce them to a set of 2 "contrasts" in
>> the model. ?(The reason for quoting the word "contrasts" is that there
>> is a formal mathematical definition of a contrast but the linear
>> combinations generated by R do not always satisfy this definition.
>> The method and results are correct, it is just the name that is
>> inaccurate.)
>>
>> The reason that the latter is preferred is that it is easier to
>> maintain the data in a consistent form (factors maintain consistency
>> and are easy to check in the output from str() or summary(), whereas
>> indicator columns have inter-column dependencies that must be checked
>> separately) and the "when appropriate" clause above. ?Determining a
>> useful parameterization of a linear model incorporating factors is
>> subtle and a lot of code in the R function model.matrix is devoted to
>> a symbolic analysis designed to get this right. ?Also, you can, if you
>> wish, change the parameterization (see ?contrasts).
>>
>
>
>
> --
> Raldo
>