Skip to content

ordering of factor levels in regression changes result

2 messages · Tulinsky, Thomas, David Winsemius

#
I was surprised to find that just changing the base level of a factor variable changed the number of significant coefficients in the solution.

I was surprised at this and want to know how I should choose the order of the factors, if the order affects the result.

Here is the small example. It is taken from 'The R Book', Crawley  p. 365. 

The data is at 
http://www.bio.ic.ac.uk/research/mjcraw/therbook/data/competition.txt

In R

   > comp<-read.table("C:\\Temp\\competition.txt", header=T)

   > attach(comp)

Data has dependent variable 'biomass' and different types of 'clipping' that were done:
Control (none), n25, n50, r10, r5:

   >  summary(comp)
       biomass         clipping
   Min.   :415.0   control:6  
    1st Qu.:508.8   n25    :6  
    Median :568.0   n50    :6  
    Mean   :561.8   r10    :6  
    3rd Qu.:631.8   r5     :6  
    Max.   :731.0              

List mean Biomass of each type of Clipping:

   > aggregate (comp$biomass, list (comp$clipping) , mean)
     Group.1        x
    control 465.1667
        n25 553.3333
        n50 569.3333
        r10 610.6667
         r5 610.5000

do regression - get same result as book p. 365
Clipping type 'control' is not in list of coefficients because it is first alphabetically so it is folded into Intercept

   > model<-lm(biomass ~ clipping)
   > summary(model)

   Call:
   lm(formula = biomass ~ clipping)

   Residuals:
        Min       1Q   Median       3Q      Max 
   -103.333  -49.667    3.417   43.375  177.667 

   Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
   (Intercept)   465.17      28.75  16.177  9.4e-15 ***
   clippingn25    88.17      40.66   2.168  0.03987 *  
   clippingn50   104.17      40.66   2.562  0.01683 *  
   clippingr10   145.50      40.66   3.578  0.00145 ** 
   clippingr5    145.33      40.66   3.574  0.00147 ** 
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

   Residual standard error: 70.43 on 25 degrees of freedom
   Multiple R-squared: 0.4077,     Adjusted R-squared: 0.3129 
   F-statistic: 4.302 on 4 and 25 DF,  p-value: 0.008752 


Relevel - make 'n25' the base level of Clipping:

   > comp$clipping <- relevel (comp$clipping, ref="n25")

   > summary(comp)
       biomass         clipping
    Min.   :415.0   n25    :6  
    1st Qu.:508.8   control:6  
    Median :568.0   n50    :6  
    Mean   :561.8   r10    :6  
    3rd Qu.:631.8   r5     :6  
    Max.   :731.0              

Redo LM with releveled data

   > modelRelev<-lm(biomass~clipping, data=comp)

Different results. (Some parts, Residuals and Std Errors,  are the same)
Especially note the Pr and Signifcance columns are different.

   > summary(modelRelev)

   Call:
   lm(formula = biomass ~ clipping, data = comp)

   Residuals:  
        Min       1Q   Median       3Q      Max 
   -103.333  -49.667    3.417   43.375  177.667 

   Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
   (Intercept)       553.33      28.75  19.244   <2e-16 ***
   clippingcontrol   -88.17      40.66  -2.168   0.0399 *  
   clippingn50        16.00      40.66   0.393   0.6973    
   clippingr10        57.33      40.66   1.410   0.1709    
   clippingr5         57.17      40.66   1.406   0.1721    
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

   Residual standard error: 70.43 on 25 degrees of freedom
   Multiple R-squared: 0.4077,     Adjusted R-squared: 0.3129 
   F-statistic: 4.302 on 4 and 25 DF,  p-value: 0.008752
#
On Feb 3, 2012, at 4:16 PM, Tulinsky, Thomas wrote:

            
In the first model you are getting R's default contrast between the  
"control" levels and each of the other levels, while in the second you  
are getting contrasts between N25 and the others. I would think that  
the most interest would be on the first set of results , but it could  
also be that you are not testing for what your really want. Is it  
scientifically interesting to consider the ordinal scale of effects?  
Perhaps you should be looking at a linear or quadratic fit?

Looking at the text you cite, it becomes clear that you need to read  
the rest of the chapter before submitting questions to R-help.
In this case there are no other covariates,  so it is not so much  
folded into the intercept as it really IS the "Intercept".
David Winsemius, MD
West Hartford, CT