Skip to content

Understanding lm-based analysis of fractional factorial experiments

13 messages · Kjetil Kjernsmo, Ben Bolker, Ista Zahn +3 more

#
All,

I have just returned to R after a decade of absence, and it is good to 
see that R has become such a great success! I'm trying to bring Design 
of Experiments into some aspects of software performance evaluation, and 
to teach myself that, I picked up "Experiments: Planning, Analysis and 
Optimization" by Wu and Hamada. I try to reproduce an analysis in the 
book using lm, but have to conclude I don't understand what lm does in 
this context, even though I end up at the desired result. I'm currently 
using R 2.15.2 on a recent Fedora system, but I get the same result on 
Debian Wheezy and Debian Squeeze. I think the discussion below can be 
followed without having the book at hand though.

I'm working with tables 5.2 and 5.5 in the above mentioned book. Table 
5.2 contains data from the "Leaf spring experiment". The dataset is also 
in this zip file:

ftp://ftp.wiley.com/public/sci_tech_med/experiments-planning/data%20sets.zip

I've learned from the book that the effects can be found using a linear 
model and double the coefficients. So, I do
 > leaf <- 
read.table("/ifi/bifrost/a03/kjekje/fag/experimental-planning/book-datasets/LeafSpring 
table 5.2.dat", col.names=c("B", "C", "D", "E", "Q", paste("r", 1:3, 
sep=""), "yavg", "ssq", "lnssq"))
 > leaf.lm <- lm(yavg ~ B * C * D * E * Q, data=leaf)
 > leaf.lm

Call:
lm(formula = yavg ~ B * C * D * E * Q, data = leaf)

Coefficients:
    (Intercept)              B+              C+              D+ 
      E+
        7.54000         0.07003         0.32333        -0.09668 
0.07668
             Q+           B+:C+           B+:D+           C+:D+ 
   B+:E+
       -0.33670         0.01335         0.11995         0.02335 
      NA
          C+:E+           D+:E+           B+:Q+           C+:Q+ 
   D+:Q+
             NA              NA         0.22915        -0.25745 
0.28255
          E+:Q+        B+:C+:D+        B+:C+:E+        B+:D+:E+ 
C+:D+:E+
        0.05415              NA              NA              NA 
      NA
       B+:C+:Q+        B+:D+:Q+        C+:D+:Q+        B+:E+:Q+ 
C+:E+:Q+
        0.04160        -0.16160        -0.18840              NA 
      NA
       D+:E+:Q+     B+:C+:D+:E+     B+:C+:D+:Q+     B+:C+:E+:Q+ 
B+:D+:E+:Q+
             NA              NA              NA              NA 
      NA
    C+:D+:E+:Q+  B+:C+:D+:E+:Q+
             NA              NA

(seems there is little I can do about the line breaks here, sorry)

However, the book (table 5.5), has 0.221 for the main effect of B and 
0.176, and the above is neither this, nor half of it. Now, I can 
reproduce what's in the book with

 > lm(yavg ~ B, data=leaf)

Call:
lm(formula = yavg ~ B, data = leaf)

Coefficients:
(Intercept)           B+
      7.5254       0.2213

 > lm(yavg ~ C, data=leaf)

Call:
lm(formula = yavg ~ C, data = leaf)

Coefficients:
(Intercept)           C+
      7.5479       0.1763

Assuming lm does in fact double the coefficient in this case, but here 
the intercept varies, which doesn't seem correct, nor can I as trivially 
find the interactions the same way.

Now, I try the effects() function, and get familiar numbers:
 > effects(leaf.lm)
(Intercept)          B+          C+          D+          E+          Q+
   -30.54415    -0.44250     0.35250    -0.05750    -0.20750    -0.51920
       B+:C+       B+:D+       C+:D+       B+:Q+       C+:Q+       D+:Q+
    -0.03415    -0.03915     0.07085    -0.16915     0.33085    -0.10755
       E+:Q+    B+:C+:Q+    B+:D+:Q+    C+:D+:Q+
     0.05415    -0.02080     0.08080    -0.09420

and indeed, I have verified that effects(leaf.lm)/2 gives me the 
expected result.

So, I have found the correct answer, but I don't understand why. I have 
read the documentation for effects() as well as looked through the 
relevant chapter in "Statistical Models in S", but from that all I got 
was that I suppose there is a hint in the phrase "the effects are the 
uncorrelated single-degree-of-freedom", and that is somewhat different 
from the coefficients, but I can't make out from the book (Wu & Hamada) 
why the coefficients should be any different than the effects, to the 
contrary, it is quite clear from equation (5.8) in the book that the 
coefficients they use are effects(leaf.lm)/4.

So, there are at least two points of confusion here, one is how coef() 
differs from effects() in the case of fractional factorial experiments, 
and the other is the factor 1/4 between the coefficients used by Wu & 
Hamada and the values returned by effects() as I would think from theory 
I've read that it should be a factor 2.

Best regards,

Kjetil
#
Kjetil Kjernsmo <kjekje <at> ifi.uio.no> writes:
Just a quick thought (sorry for removing context): what happens if
you use sum-to-zero contrasts throughout, i.e. options(contrasts=c("contr.sum",
"contr.poly")) ... ?
#
Hi,
On Wed, Mar 6, 2013 at 5:46 AM, Kjetil Kjernsmo <kjekje at ifi.uio.no> wrote:
I have my doubts...
That is complete nonsense:
[1] 16 11
[1] 32

So you are trying to estimate 32 coefficients from 16 data points.
That is never going to work.
I have no idea what this means.

 but here the
You mean that the intercept for

lm(yavg ~ B, data=leaf)

differs from the intercept for

lm(yavg ~ C, data=leaf)

? If so that is expected. The intercept is the expected value of yavg
when all predictors are zero. The expected value for B = zero does not
have to be the same as the expected value for C = 0.

 nor can I as trivially find
What way?
I don't know the answers to these questions (I don't really understand
the effects function). So I'm afraid all I've done is add more
questions, i.e., why in the world are we estimating 32 coefficients
from 16 points?

Best,
Ista
#
On 03/06/2013 02:50 PM, Ben Bolker wrote:
That works (except for the sign)! What would this mean?

Kjetil
#
As Ista indicates, the basic issue is that the OP does not understand
linear modeling and is therefore just thrashing around with lm. For
example, the statement about effects being double coefficient is only
true with the orthogonal (-1,1) parameterization of the contrasts.

So I suggest the OP either find some local statistical help or start
reading up on linear models, rather than wasting further time and
space here.

-- Bert
On Wed, Mar 6, 2013 at 6:17 AM, Ista Zahn <istazahn at gmail.com> wrote:

  
    
#
On Mar 6, 2013, at 4:46 AM, Kjetil Kjernsmo <kjekje at ifi.uio.no> wrote:

            
I'll ignore the rest of your question, in the hope that this will answer them sufficiently.

You probably want a simple linear model, specified in R using "+" instead of "*".
Call:
lm(formula = yavg ~ B + C + D + E + Q, data = leaf)

Coefficients:
(Intercept)           B+           C+           D+           E+           Q+  
    7.50084      0.22125      0.17625      0.02875      0.10375     -0.25960  

Does this give you the numbers you expect?

Peter
#
On Mar 6, 2013, at 4:46 AM, Kjetil Kjernsmo <kjekje at ifi.uio.no> wrote:

            
I'll ignore the rest of your question, in the hope that this will answer them sufficiently.

You probably want a simple linear model, specified in R using "+" instead of "*".
Call:
lm(formula = yavg ~ B + C + D + E + Q, data = leaf)

Coefficients:
(Intercept)           B+           C+           D+           E+           Q+  
   7.50084      0.22125      0.17625      0.02875      0.10375     -0.25960  

Does this give you the numbers you expect?

Peter
#
On 03/06/2013 04:18 PM, Peter Claussen wrote:
OK!
Well, it partly gives the numbers I expect, but I want the interactions 
as well, so it is only a partial answer.

Best,

Kjetil
#
On Mar 6, 2013, at 9:23 AM, Kjetil Kjernsmo <kjekje at ifi.uio.no> wrote:

            
But you don't have enough data points to estimate all of the possible interactions; that's why you have NA in your original results. You could add the just the first order interactions manually, i.e.,  + B:C + B:D ?

Peter
#
On Wednesday 6. March 2013 16.33.34 Peter Claussen wrote:
Yes, but it seems to me that lm is doing the right thing, or at least the 
expected thing, here, the NA's are simply telling me these are aliased, which 
is correct and expected.
Yeah, I tried that, but then it returns to the "unexpected" result, i.e., I 
get the same result as with the yavg ~ B * C * D * E * Q formula. Therefore, I 
think the problem doesn't lie with the formula, nor does it lie with any of 
the code, it is just a matter of understanding defaults...

I have consulted local help (of course), but what they say is that "R has some 
odd defaults, you need to ask them or use something different". I don't want to 
use something different, I like R, I have contributed to R in the past and will 
do so again if only I can get my head around this... :-)

Kjetil
#
On Thu, Mar 7, 2013 at 5:47 AM, Kjetil Kjernsmo <kjekje at ifi.uio.no> wrote:
I agree lm is doing the correct thing, but I think you are not. It
simply does not make sense:

# your model tries to estimate means for all 32 of these groups
pm <- c("+", "-")
tmp <- expand.grid(B=pm, C=pm, D=pm, E=pm, Q=pm)
dim(tmp)

# but your data has only 16 of them!
interaction(tmp) %in% interaction(leaf[c("B", "C", "D", "E", "Q")])

# you model asks for estimates for these groups, but they do not exist
in your data
tmp[!interaction(tmp) %in% interaction(leaf[c("B", "C", "D", "E", "Q")]), ]
Start again, and forget the book which most of us do not have. Explain
what you are trying to do. Base the explanation on the data and
example at hand, and explain what you consider to be the expected
result and why.

Best,
Ista
#
Some observations to throw into the mix here.

First, the model.
is indeed unnecessarily complicated. You can request all second order interactions (accepting that some will be NA) with
or of course you can specify only the ones you know will be present:
I cheated a bit:
#less typing, same model (with NAs for aliased interactions)

Now lets look at those effects.
First, look at the model.matrix for leaf.lm.
This shows you what your yavg is actually being regressed against - it's a matrix of dummy codings for your (character) levels "+" and "-". 
Notice the  values for B (under "B+") and the intercept. The first row (which has "-" for B in the leaf data set) has a zero coefficient; the second has a 1. So in this matrix, 1 corresponds to "+" and "-" corresponds to 0, which you can read as "not different from the intercept". This arises from teh way contrasts have been chosen; these arise from 'treatment contrasts', R's default for factors. Pretty much the same goes for all the other factors. This structure corresponds to measuring the 'alternate' level of every  factor - always "+" because the first level is "-" -  against the intercept. What the intercept represents is not immediately obvious if you;re used to the mean of the study as the intercept; it estimates the when all factors are in their 'zero' state. That's really useful if you want to test a series of alternate treatments against  a control. It's also R's default. But it's not what most industrial experiments based on 2-level fractional factorials want; they usually want differences between levels. That is why some folk would call R's defaults 'odd defaults'. 

To get something closer to the usual fractional factorial usage, you need to change those dummy codings from (0,1) for "-", "+" to something like +1, -1. You can change that using contrasts. To change the contrasts to something most industrial experimenters would expect, we use sum-to-zero contrasts. So we do this instead (using my leaf.avg data frame above):.
Now we can see that the coefficient for B is indeed half the difference between levels, as you might have expected. And the intercept is now equal to the mean of the data, as you might expect from much of the industrial experiment design literature. 
So - why half, and why is it negative?

Look at the new model matrix:
This time, B is full of +1 and -1, instead of 0 and 1. So first, B+ and B- are both different (in equal and opposite directions) from the intercept. 
Second, the range allocated to B is (+1 - -1) = 2, so the change in yavg per _unit_ change in B  (the lm coefficient for B)- is half the difference between levels for B. 
Finally, look at the particular allocation of model matrix values for B. The first row has +1, the second row -1. That's because contr.sum has allocated +1 to the first level of the factor B, which (if you look at levels(leaf$B) is "-" because factor() uses a default alphabetic order. (You could change that for all your factors if you wanted, for example with leaf$B <- factor(leaf$B, levels=c("+","-")) and you'd then have +1 for "+" and -1 for "-") . But in the mean time, lm has been told that "-" corresponds to an _increase_ in the dummy coding for B. Since the mean for B=="-" is lower than the intercept, you get a negative coefficient.

effects() is doing somethig quite different; the help page tells you what its doing mathematically. It has an indirect physical interpretation: for simple designs like this, effects() output for the coefficients comes out as the coefficient multiplied by sqrt(n), where n is the number of observations in the experiment. But it is not giving you the effect of a factor in the sense of an effect of change in factor level on mean response.

Hope some of that is useful to you.

S Ellison

*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}
#
On Mar 7, 2013, at 4:47 AM, Kjetil Kjernsmo <kjekje at ifi.uio.no> wrote:

            
I ran across the text you reference while looking up a couple other references yesterday; having read the appropriate section I can better understand your questions.

I would say in this case that what R is doing is expected, but not necessarily correct, for this problem. 

The authors go into some detail about aliasing treatments, and that there are different choices for aliasing. I didn't have my computer handy, but does R choose the same set of aliases as the authors?

Peter