Interpreting model matrix columns when using contr.sum

With the following example using contr.sum for both factors,
dd <- data.frame(a = gl(3,4), b = gl(4,1,12))     # balanced 2-way
model.matrix(~ a * b, dd, contrasts = list(a="contr.sum", b="contr.sum"))
(Intercept) a1 a2 b1 b2 b3 a1:b1 a2:b1 a1:b2 a2:b2 a1:b3 a2:b3
1            1  1  0  1  0  0     1     0     0     0     0     0
2            1  1  0  0  1  0     0     0     1     0     0     0
3            1  1  0  0  0  1     0     0     0     0     1     0
4            1  1  0 -1 -1 -1    -1     0    -1     0    -1     0
5            1  0  1  1  0  0     0     1     0     0     0     0
6            1  0  1  0  1  0     0     0     0     1     0     0
7            1  0  1  0  0  1     0     0     0     0     0     1
8            1  0  1 -1 -1 -1     0    -1     0    -1     0    -1
9            1 -1 -1  1  0  0    -1    -1     0     0     0     0
10           1 -1 -1  0  1  0     0     0    -1    -1     0     0
11           1 -1 -1  0  0  1     0     0     0     0    -1    -1
12           1 -1 -1 -1 -1 -1     1     1     1     1     1     1
...

I have two questions:

(1) I assume the 1st column (under intercept) is the overall mean, the
2rd column (under a1) is the difference between the 1st level of
factor a and the overall mean, the 4th column (under b1) is the
difference between the 1st level of factor b and the overall mean. Is
this interpretation correct?

(2) I'm not so sure about those interaction columns. For example, what
is a1:b1? Is it the 1st level of factor a at the 1st level of factor b
versus the overall mean, or something more complicated?

Thanks in advance for your help,
Gang
With the following example using contr.sum for both factors,

dd <- data.frame(a = gl(3,4), b = gl(4,1,12))     # balanced 2-way
model.matrix(~ a * b, dd, contrasts = list(a="contr.sum", b="contr.sum"))
  (Intercept) a1 a2 b1 b2 b3 a1:b1 a2:b1 a1:b2 a2:b2 a1:b3 a2:b3
1            1  1  0  1  0  0     1     0     0     0     0     0
2            1  1  0  0  1  0     0     0     1     0     0     0
3            1  1  0  0  0  1     0     0     0     0     1     0
4            1  1  0 -1 -1 -1    -1     0    -1     0    -1     0
5            1  0  1  1  0  0     0     1     0     0     0     0
6            1  0  1  0  1  0     0     0     0     1     0     0
7            1  0  1  0  0  1     0     0     0     0     0     1
8            1  0  1 -1 -1 -1     0    -1     0    -1     0    -1
9            1 -1 -1  1  0  0    -1    -1     0     0     0     0
10           1 -1 -1  0  1  0     0     0    -1    -1     0     0
11           1 -1 -1  0  0  1     0     0     0     0    -1    -1
12           1 -1 -1 -1 -1 -1     1     1     1     1     1     1
...
I have two questions:
(1) I assume the 1st column (under intercept) is the overall mean, the
2rd column (under a1) is the difference between the 1st level of
factor a and the overall mean, the 4th column (under b1) is the
difference between the 1st level of factor b and the overall mean.
Is this interpretation correct?
I don't think so and furthermore I don't see why the contrasts should
have an interpretation.  The contrasts are simply a parameterization
of the space spanned by the indicator columns of the levels of the
factors.  Interpretations as overall means, etc. are mostly a holdover
from antiquated concepts of how analysis of variance tables should be
evalated.

If you want to determine the interpretation of particular coefficients
for the special case of a balanced design (which doesn't always mean a
resulting balanced data set - I remind my students that expecting a
balanced design to produce balanced data is contrary to Murphy's Law)
the easiest way of doing so is (I think this is right but I can
somehow manage to confuse myself on this with great ease) to calculate
contr.sum(3)
[,1] [,2]
1    1    0
2    0    1
3   -1   -1
solve(cbind(1, contr.sum(3)))
1          2          3
[1,]  0.3333333  0.3333333  0.3333333
[2,]  0.6666667 -0.3333333 -0.3333333
[3,] -0.3333333  0.6666667 -0.3333333
solve(cbind(1, contr.sum(4)))
1     2     3     4
[1,]  0.25  0.25  0.25  0.25
[2,]  0.75 -0.25 -0.25 -0.25
[3,] -0.25  0.75 -0.25 -0.25
[4,] -0.25 -0.25  0.75 -0.25

That is, the first coefficient is the "overall mean" (but only for a
balanced data set), the second is a contrast of the first level with
the others, the third is a contrast of the second level with the
others and so on.
(2) I'm not so sure about those interaction columns. For example, what
is a1:b1? Is it the 1st level of factor a at the 1st level of factor b
versus the overall mean, or something more complicated?
Well, at the risk of sounding trivial, a1:b1 is the product of the a1
and b1 columns.  You need a basis for a certain subspace and this
provides one.  I don't see why there must be interpretations of the
coefficients.
Dear Doug and Gang Chen,

With balanced data and sum-to-zero contrasts, the intercept is indeed the
general mean of the response; the coefficient of a1 is the mean of the
response in category a1 minus the general mean; the coefficient of a1:b1 is
the mean of the response in cell a1, b1 minus the general mean and the
coefficients of a1 and b1; etc. For unbalanced data (and balanced data) the
intercept is the mean of the cell means; the coefficient of a1 is the mean
of cell means at level a1 minus the intercept; etc. Whether all this is of
interest is another question, since a simple graph of cell means tells a
more digestible story about the data.

Regards,
 John

------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
web: socserv.mcmaster.ca/jfox
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On
Behalf Of Douglas Bates
Sent: January-25-09 10:49 AM
To: Gang Chen
Cc: R-help
Subject: Re: [R] Interpreting model matrix columns when using contr.sum

On Fri, Jan 23, 2009 at 4:58 PM, Gang Chen <gangchen6 at gmail.com> wrote:
With the following example using contr.sum for both factors,

dd <- data.frame(a = gl(3,4), b = gl(4,1,12))     # balanced 2-way
model.matrix(~ a * b, dd, contrasts = list(a="contr.sum",
b="contr.sum"))
  (Intercept) a1 a2 b1 b2 b3 a1:b1 a2:b1 a1:b2 a2:b2 a1:b3 a2:b3
1            1  1  0  1  0  0     1     0     0     0     0     0
2            1  1  0  0  1  0     0     0     1     0     0     0
3            1  1  0  0  0  1     0     0     0     0     1     0
4            1  1  0 -1 -1 -1    -1     0    -1     0    -1     0
5            1  0  1  1  0  0     0     1     0     0     0     0
6            1  0  1  0  1  0     0     0     0     1     0     0
7            1  0  1  0  0  1     0     0     0     0     0     1
8            1  0  1 -1 -1 -1     0    -1     0    -1     0    -1
9            1 -1 -1  1  0  0    -1    -1     0     0     0     0
10           1 -1 -1  0  1  0     0     0    -1    -1     0     0
11           1 -1 -1  0  0  1     0     0     0     0    -1    -1
12           1 -1 -1 -1 -1 -1     1     1     1     1     1     1
...

I have two questions:

(1) I assume the 1st column (under intercept) is the overall mean, the
2rd column (under a1) is the difference between the 1st level of
factor a and the overall mean, the 4th column (under b1) is the
difference between the 1st level of factor b and the overall mean.

Is this interpretation correct?
I don't think so and furthermore I don't see why the contrasts should
have an interpretation.  The contrasts are simply a parameterization
of the space spanned by the indicator columns of the levels of the
factors.  Interpretations as overall means, etc. are mostly a holdover
from antiquated concepts of how analysis of variance tables should be
evalated.

If you want to determine the interpretation of particular coefficients
for the special case of a balanced design (which doesn't always mean a
resulting balanced data set - I remind my students that expecting a
balanced design to produce balanced data is contrary to Murphy's Law)
the easiest way of doing so is (I think this is right but I can
somehow manage to confuse myself on this with great ease) to calculate

contr.sum(3)
  [,1] [,2]
1    1    0
2    0    1
3   -1   -1
solve(cbind(1, contr.sum(3)))
              1          2          3
[1,]  0.3333333  0.3333333  0.3333333
[2,]  0.6666667 -0.3333333 -0.3333333
[3,] -0.3333333  0.6666667 -0.3333333
solve(cbind(1, contr.sum(4)))
         1     2     3     4
[1,]  0.25  0.25  0.25  0.25
[2,]  0.75 -0.25 -0.25 -0.25
[3,] -0.25  0.75 -0.25 -0.25
[4,] -0.25 -0.25  0.75 -0.25

That is, the first coefficient is the "overall mean" (but only for a
balanced data set), the second is a contrast of the first level with
the others, the third is a contrast of the second level with the
others and so on.

(2) I'm not so sure about those interaction columns. For example, what
is a1:b1? Is it the 1st level of factor a at the 1st level of factor b
versus the overall mean, or something more complicated?
Well, at the risk of sounding trivial, a1:b1 is the product of the a1
and b1 columns.  You need a basis for a certain subspace and this
provides one.  I don't see why there must be interpretations of the
coefficients.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Many thanks to both Drs. Bates and Fox for the help!

I also figured out yesterday what Dr. Fox just said regarding the
interpretations of those coefficients for a balanced design. Thanks
Dr. Bates for the suggestion of using solve(cbind(1, contr.sum(4))) to
sort out the factor level effects. Model validation is very important,
but interpreting those coefficients, at least in the case of balanced
designs, also provides some insights about various effects for the
people working in the field.

Gang
Dear Doug and Gang Chen,

With balanced data and sum-to-zero contrasts, the intercept is indeed the
general mean of the response; the coefficient of a1 is the mean of the
response in category a1 minus the general mean; the coefficient of a1:b1 is
the mean of the response in cell a1, b1 minus the general mean and the
coefficients of a1 and b1; etc. For unbalanced data (and balanced data) the
intercept is the mean of the cell means; the coefficient of a1 is the mean
of cell means at level a1 minus the intercept; etc. Whether all this is of
interest is another question, since a simple graph of cell means tells a
more digestible story about the data.

Regards,
 John

------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
web: socserv.mcmaster.ca/jfox

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On
Behalf Of Douglas Bates
Sent: January-25-09 10:49 AM
To: Gang Chen
Cc: R-help
Subject: Re: [R] Interpreting model matrix columns when using contr.sum

On Fri, Jan 23, 2009 at 4:58 PM, Gang Chen <gangchen6 at gmail.com> wrote:
With the following example using contr.sum for both factors,

dd <- data.frame(a = gl(3,4), b = gl(4,1,12))     # balanced 2-way
model.matrix(~ a * b, dd, contrasts = list(a="contr.sum",
b="contr.sum"))
  (Intercept) a1 a2 b1 b2 b3 a1:b1 a2:b1 a1:b2 a2:b2 a1:b3 a2:b3
1            1  1  0  1  0  0     1     0     0     0     0     0
2            1  1  0  0  1  0     0     0     1     0     0     0
3            1  1  0  0  0  1     0     0     0     0     1     0
4            1  1  0 -1 -1 -1    -1     0    -1     0    -1     0
5            1  0  1  1  0  0     0     1     0     0     0     0
6            1  0  1  0  1  0     0     0     0     1     0     0
7            1  0  1  0  0  1     0     0     0     0     0     1
8            1  0  1 -1 -1 -1     0    -1     0    -1     0    -1
9            1 -1 -1  1  0  0    -1    -1     0     0     0     0
10           1 -1 -1  0  1  0     0     0    -1    -1     0     0
11           1 -1 -1  0  0  1     0     0     0     0    -1    -1
12           1 -1 -1 -1 -1 -1     1     1     1     1     1     1
...

I have two questions:

(1) I assume the 1st column (under intercept) is the overall mean, the
2rd column (under a1) is the difference between the 1st level of
factor a and the overall mean, the 4th column (under b1) is the
difference between the 1st level of factor b and the overall mean.

Is this interpretation correct?
I don't think so and furthermore I don't see why the contrasts should
have an interpretation.  The contrasts are simply a parameterization
of the space spanned by the indicator columns of the levels of the
factors.  Interpretations as overall means, etc. are mostly a holdover
from antiquated concepts of how analysis of variance tables should be
evalated.

If you want to determine the interpretation of particular coefficients
for the special case of a balanced design (which doesn't always mean a
resulting balanced data set - I remind my students that expecting a
balanced design to produce balanced data is contrary to Murphy's Law)
the easiest way of doing so is (I think this is right but I can
somehow manage to confuse myself on this with great ease) to calculate

contr.sum(3)
  [,1] [,2]
1    1    0
2    0    1
3   -1   -1
solve(cbind(1, contr.sum(3)))
              1          2          3
[1,]  0.3333333  0.3333333  0.3333333
[2,]  0.6666667 -0.3333333 -0.3333333
[3,] -0.3333333  0.6666667 -0.3333333
solve(cbind(1, contr.sum(4)))
         1     2     3     4
[1,]  0.25  0.25  0.25  0.25
[2,]  0.75 -0.25 -0.25 -0.25
[3,] -0.25  0.75 -0.25 -0.25
[4,] -0.25 -0.25  0.75 -0.25

That is, the first coefficient is the "overall mean" (but only for a
balanced data set), the second is a contrast of the first level with
the others, the third is a contrast of the second level with the
others and so on.

(2) I'm not so sure about those interaction columns. For example, what
is a1:b1? Is it the 1st level of factor a at the 1st level of factor b
versus the overall mean, or something more complicated?
Well, at the risk of sounding trivial, a1:b1 is the product of the a1
and b1 columns.  You need a basis for a certain subspace and this
provides one.  I don't see why there must be interpretations of the
coefficients.