I was surprised to find that just changing the base level of a factor variable changed the number of significant coefficients in the solution. I was surprised at this and want to know how I should choose the order of the factors, if the order affects the result. Here is the small example. It is taken from 'The R Book', Crawley p. 365. The data is at http://www.bio.ic.ac.uk/research/mjcraw/therbook/data/competition.txt In R > comp<-read.table("C:\\Temp\\competition.txt", header=T) > attach(comp) Data has dependent variable 'biomass' and different types of 'clipping' that were done: Control (none), n25, n50, r10, r5: > summary(comp) biomass clipping Min. :415.0 control:6 1st Qu.:508.8 n25 :6 Median :568.0 n50 :6 Mean :561.8 r10 :6 3rd Qu.:631.8 r5 :6 Max. :731.0 List mean Biomass of each type of Clipping: > aggregate (comp$biomass, list (comp$clipping) , mean) Group.1 x control 465.1667 n25 553.3333 n50 569.3333 r10 610.6667 r5 610.5000 do regression - get same result as book p. 365 Clipping type 'control' is not in list of coefficients because it is first alphabetically so it is folded into Intercept > model<-lm(biomass ~ clipping) > summary(model) Call: lm(formula = biomass ~ clipping) Residuals: Min 1Q Median 3Q Max -103.333 -49.667 3.417 43.375 177.667 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 465.17 28.75 16.177 9.4e-15 *** clippingn25 88.17 40.66 2.168 0.03987 * clippingn50 104.17 40.66 2.562 0.01683 * clippingr10 145.50 40.66 3.578 0.00145 ** clippingr5 145.33 40.66 3.574 0.00147 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 70.43 on 25 degrees of freedom Multiple R-squared: 0.4077, Adjusted R-squared: 0.3129 F-statistic: 4.302 on 4 and 25 DF, p-value: 0.008752 Relevel - make 'n25' the base level of Clipping: > comp$clipping <- relevel (comp$clipping, ref="n25") > summary(comp) biomass clipping Min. :415.0 n25 :6 1st Qu.:508.8 control:6 Median :568.0 n50 :6 Mean :561.8 r10 :6 3rd Qu.:631.8 r5 :6 Max. :731.0 Redo LM with releveled data > modelRelev<-lm(biomass~clipping, data=comp) Different results. (Some parts, Residuals and Std Errors, are the same) Especially note the Pr and Signifcance columns are different. > summary(modelRelev) Call: lm(formula = biomass ~ clipping, data = comp) Residuals: Min 1Q Median 3Q Max -103.333 -49.667 3.417 43.375 177.667 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 553.33 28.75 19.244 <2e-16 *** clippingcontrol -88.17 40.66 -2.168 0.0399 * clippingn50 16.00 40.66 0.393 0.6973 clippingr10 57.33 40.66 1.410 0.1709 clippingr5 57.17 40.66 1.406 0.1721 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 70.43 on 25 degrees of freedom Multiple R-squared: 0.4077, Adjusted R-squared: 0.3129 F-statistic: 4.302 on 4 and 25 DF, p-value: 0.008752
ordering of factor levels in regression changes result
2 messages · Tulinsky, Thomas, David Winsemius
On Feb 3, 2012, at 4:16 PM, Tulinsky, Thomas wrote:
I was surprised to find that just changing the base level of a factor variable changed the number of significant coefficients in the solution. I was surprised at this and want to know how I should choose the order of the factors, if the order affects the result.
In the first model you are getting R's default contrast between the "control" levels and each of the other levels, while in the second you are getting contrasts between N25 and the others. I would think that the most interest would be on the first set of results , but it could also be that you are not testing for what your really want. Is it scientifically interesting to consider the ordinal scale of effects? Perhaps you should be looking at a linear or quadratic fit? Looking at the text you cite, it becomes clear that you need to read the rest of the chapter before submitting questions to R-help.
Here is the small example. It is taken from 'The R Book', Crawley p. 365. The data is at http://www.bio.ic.ac.uk/research/mjcraw/therbook/data/competition.txt In R
comp<-read.table("C:\\Temp\\competition.txt", header=T)
attach(comp)
Data has dependent variable 'biomass' and different types of 'clipping' that were done: Control (none), n25, n50, r10, r5:
summary(comp)
biomass clipping Min. :415.0 control:6 1st Qu.:508.8 n25 :6 Median :568.0 n50 :6 Mean :561.8 r10 :6 3rd Qu.:631.8 r5 :6 Max. :731.0 List mean Biomass of each type of Clipping:
aggregate (comp$biomass, list (comp$clipping) , mean)
Group.1 x
control 465.1667
n25 553.3333
n50 569.3333
r10 610.6667
r5 610.5000
do regression - get same result as book p. 365
Clipping type 'control' is not in list of coefficients because it is
first alphabetically so it is folded into Intercept
In this case there are no other covariates, so it is not so much folded into the intercept as it really IS the "Intercept".
model<-lm(biomass ~ clipping) summary(model)
Call:
lm(formula = biomass ~ clipping)
Residuals:
Min 1Q Median 3Q Max
-103.333 -49.667 3.417 43.375 177.667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 465.17 28.75 16.177 9.4e-15 ***
clippingn25 88.17 40.66 2.168 0.03987 *
clippingn50 104.17 40.66 2.562 0.01683 *
clippingr10 145.50 40.66 3.578 0.00145 **
clippingr5 145.33 40.66 3.574 0.00147 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 70.43 on 25 degrees of freedom
Multiple R-squared: 0.4077, Adjusted R-squared: 0.3129
F-statistic: 4.302 on 4 and 25 DF, p-value: 0.008752
Relevel - make 'n25' the base level of Clipping:
comp$clipping <- relevel (comp$clipping, ref="n25")
summary(comp)
biomass clipping Min. :415.0 n25 :6 1st Qu.:508.8 control:6 Median :568.0 n50 :6 Mean :561.8 r10 :6 3rd Qu.:631.8 r5 :6 Max. :731.0 Redo LM with releveled data
modelRelev<-lm(biomass~clipping, data=comp)
Different results. (Some parts, Residuals and Std Errors, are the same) Especially note the Pr and Signifcance columns are different.
summary(modelRelev)
Call:
lm(formula = biomass ~ clipping, data = comp)
Residuals:
Min 1Q Median 3Q Max
-103.333 -49.667 3.417 43.375 177.667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 553.33 28.75 19.244 <2e-16 ***
clippingcontrol -88.17 40.66 -2.168 0.0399 *
clippingn50 16.00 40.66 0.393 0.6973
clippingr10 57.33 40.66 1.410 0.1709
clippingr5 57.17 40.66 1.406 0.1721
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 70.43 on 25 degrees of freedom
Multiple R-squared: 0.4077, Adjusted R-squared: 0.3129
F-statistic: 4.302 on 4 and 25 DF, p-value: 0.008752
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT