Hi,
On Wed, Mar 6, 2013 at 5:46 AM, Kjetil Kjernsmo <kjekje at ifi.uio.no> wrote:
All,
I have just returned to R after a decade of absence, and it is good to see
that R has become such a great success! I'm trying to bring Design of
Experiments into some aspects of software performance evaluation, and to
teach myself that, I picked up "Experiments: Planning, Analysis and
Optimization" by Wu and Hamada. I try to reproduce an analysis in the book
using lm, but have to conclude I don't understand what lm does in this
context, even though I end up at the desired result. I'm currently using R
2.15.2 on a recent Fedora system, but I get the same result on Debian Wheezy
and Debian Squeeze. I think the discussion below can be followed without
having the book at hand though.
I'm working with tables 5.2 and 5.5 in the above mentioned book. Table 5.2
contains data from the "Leaf spring experiment". The dataset is also in this
zip file:
ftp://ftp.wiley.com/public/sci_tech_med/experiments-planning/data%20sets.zip
I've learned from the book that the effects can be found using a linear
model and double the coefficients. So, I do
leaf <-
read.table("/ifi/bifrost/a03/kjekje/fag/experimental-planning/book-datasets/LeafSpring
table 5.2.dat", col.names=c("B", "C", "D", "E", "Q", paste("r", 1:3,
sep=""), "yavg", "ssq", "lnssq"))
leaf.lm <- lm(yavg ~ B * C * D * E * Q, data=leaf)
That is complete nonsense:
[1] 32
So you are trying to estimate 32 coefficients from 16 data points.
That is never going to work.
Call:
lm(formula = yavg ~ B * C * D * E * Q, data = leaf)
Coefficients:
(Intercept) B+ C+ D+ E+
7.54000 0.07003 0.32333 -0.09668 0.07668
Q+ B+:C+ B+:D+ C+:D+ B+:E+
-0.33670 0.01335 0.11995 0.02335 NA
C+:E+ D+:E+ B+:Q+ C+:Q+ D+:Q+
NA NA 0.22915 -0.25745 0.28255
E+:Q+ B+:C+:D+ B+:C+:E+ B+:D+:E+ C+:D+:E+
0.05415 NA NA NA NA
B+:C+:Q+ B+:D+:Q+ C+:D+:Q+ B+:E+:Q+ C+:E+:Q+
0.04160 -0.16160 -0.18840 NA NA
D+:E+:Q+ B+:C+:D+:E+ B+:C+:D+:Q+ B+:C+:E+:Q+ B+:D+:E+:Q+
NA NA NA NA NA
C+:D+:E+:Q+ B+:C+:D+:E+:Q+
NA NA
(seems there is little I can do about the line breaks here, sorry)
However, the book (table 5.5), has 0.221 for the main effect of B and 0.176,
and the above is neither this, nor half of it. Now, I can reproduce what's
in the book with
Call:
lm(formula = yavg ~ B, data = leaf)
Coefficients:
(Intercept) B+
7.5254 0.2213
Call:
lm(formula = yavg ~ C, data = leaf)
Coefficients:
(Intercept) C+
7.5479 0.1763
Assuming lm does in fact double the coefficient in this case,
I have no idea what this means.
but here the
intercept varies, which doesn't seem correct,
You mean that the intercept for
lm(yavg ~ B, data=leaf)
differs from the intercept for
lm(yavg ~ C, data=leaf)
? If so that is expected. The intercept is the expected value of yavg
when all predictors are zero. The expected value for B = zero does not
have to be the same as the expected value for C = 0.
nor can I as trivially find
the interactions the same way.
Now, I try the effects() function, and get familiar numbers:
(Intercept) B+ C+ D+ E+ Q+
-30.54415 -0.44250 0.35250 -0.05750 -0.20750 -0.51920
B+:C+ B+:D+ C+:D+ B+:Q+ C+:Q+ D+:Q+
-0.03415 -0.03915 0.07085 -0.16915 0.33085 -0.10755
E+:Q+ B+:C+:Q+ B+:D+:Q+ C+:D+:Q+
0.05415 -0.02080 0.08080 -0.09420
and indeed, I have verified that effects(leaf.lm)/2 gives me the expected
result.
So, I have found the correct answer, but I don't understand why. I have read
the documentation for effects() as well as looked through the relevant
chapter in "Statistical Models in S", but from that all I got was that I
suppose there is a hint in the phrase "the effects are the uncorrelated
single-degree-of-freedom", and that is somewhat different from the
coefficients, but I can't make out from the book (Wu & Hamada) why the
coefficients should be any different than the effects, to the contrary, it
is quite clear from equation (5.8) in the book that the coefficients they
use are effects(leaf.lm)/4.
So, there are at least two points of confusion here, one is how coef()
differs from effects() in the case of fractional factorial experiments, and
the other is the factor 1/4 between the coefficients used by Wu & Hamada and
the values returned by effects() as I would think from theory I've read that
it should be a factor 2.
I don't know the answers to these questions (I don't really understand
the effects function). So I'm afraid all I've done is add more
questions, i.e., why in the world are we estimating 32 coefficients
from 16 points?
Best,
Ista
Best regards,
Kjetil
--
Kjetil Kjernsmo
PhD Research Fellow, University of Oslo, Norway
Semantic Web / SPARQL Query Federation
kjekje at ifi.uio.no http://www.kjetil.kjernsmo.net/