Dear Doug and Gang Chen,
With balanced data and sum-to-zero contrasts, the intercept is indeed the
general mean of the response; the coefficient of a1 is the mean of the
response in category a1 minus the general mean; the coefficient of a1:b1 is
the mean of the response in cell a1, b1 minus the general mean and the
coefficients of a1 and b1; etc. For unbalanced data (and balanced data) the
intercept is the mean of the cell means; the coefficient of a1 is the mean
of cell means at level a1 minus the intercept; etc. Whether all this is of
interest is another question, since a simple graph of cell means tells a
more digestible story about the data.
Regards,
John
------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
web: socserv.mcmaster.ca/jfox
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
Behalf Of Douglas Bates
Sent: January-25-09 10:49 AM
To: Gang Chen
Cc: R-help
Subject: Re: [R] Interpreting model matrix columns when using contr.sum
On Fri, Jan 23, 2009 at 4:58 PM, Gang Chen <gangchen6 at gmail.com> wrote:
With the following example using contr.sum for both factors,
dd <- data.frame(a = gl(3,4), b = gl(4,1,12)) # balanced 2-way
model.matrix(~ a * b, dd, contrasts = list(a="contr.sum",
(Intercept) a1 a2 b1 b2 b3 a1:b1 a2:b1 a1:b2 a2:b2 a1:b3 a2:b3
1 1 1 0 1 0 0 1 0 0 0 0 0
2 1 1 0 0 1 0 0 0 1 0 0 0
3 1 1 0 0 0 1 0 0 0 0 1 0
4 1 1 0 -1 -1 -1 -1 0 -1 0 -1 0
5 1 0 1 1 0 0 0 1 0 0 0 0
6 1 0 1 0 1 0 0 0 0 1 0 0
7 1 0 1 0 0 1 0 0 0 0 0 1
8 1 0 1 -1 -1 -1 0 -1 0 -1 0 -1
9 1 -1 -1 1 0 0 -1 -1 0 0 0 0
10 1 -1 -1 0 1 0 0 0 -1 -1 0 0
11 1 -1 -1 0 0 1 0 0 0 0 -1 -1
12 1 -1 -1 -1 -1 -1 1 1 1 1 1 1
...
(1) I assume the 1st column (under intercept) is the overall mean, the
2rd column (under a1) is the difference between the 1st level of
factor a and the overall mean, the 4th column (under b1) is the
difference between the 1st level of factor b and the overall mean.
Is this interpretation correct?
I don't think so and furthermore I don't see why the contrasts should
have an interpretation. The contrasts are simply a parameterization
of the space spanned by the indicator columns of the levels of the
factors. Interpretations as overall means, etc. are mostly a holdover
from antiquated concepts of how analysis of variance tables should be
evalated.
If you want to determine the interpretation of particular coefficients
for the special case of a balanced design (which doesn't always mean a
resulting balanced data set - I remind my students that expecting a
balanced design to produce balanced data is contrary to Murphy's Law)
the easiest way of doing so is (I think this is right but I can
somehow manage to confuse myself on this with great ease) to calculate
[,1] [,2]
1 1 0
2 0 1
3 -1 -1
solve(cbind(1, contr.sum(3)))
1 2 3
[1,] 0.3333333 0.3333333 0.3333333
[2,] 0.6666667 -0.3333333 -0.3333333
[3,] -0.3333333 0.6666667 -0.3333333
solve(cbind(1, contr.sum(4)))
1 2 3 4
[1,] 0.25 0.25 0.25 0.25
[2,] 0.75 -0.25 -0.25 -0.25
[3,] -0.25 0.75 -0.25 -0.25
[4,] -0.25 -0.25 0.75 -0.25
That is, the first coefficient is the "overall mean" (but only for a
balanced data set), the second is a contrast of the first level with
the others, the third is a contrast of the second level with the
others and so on.
(2) I'm not so sure about those interaction columns. For example, what
is a1:b1? Is it the 1st level of factor a at the 1st level of factor b
versus the overall mean, or something more complicated?
Well, at the risk of sounding trivial, a1:b1 is the product of the a1
and b1 columns. You need a basis for a certain subspace and this
provides one. I don't see why there must be interpretations of the
coefficients.