Hi all,
I have a pretty basic question about categorical variables but I can't
seem to be able to find answer so I am hoping someone here can help. I
found that if the factor names are all in numbers, fitting the model
in lm would return labels that are not very recognizable.
# Example: let's just assume that we want to fit this model
fit <- lm(height ~ age + Seed, data=Loblolly)
# See the category names are all mangled up here
fit
Call:
lm(formula = height ~ age + Seed, data = Loblolly)
Coefficients:
(Intercept) age Seed.L Seed.Q Seed.C
Seed^4
-1.31240 2.59052 4.86941 0.87307 0.37894
-0.46853
Seed^5 Seed^6 Seed^7 Seed^8 Seed^9
Seed^10
0.55237 0.39659 -0.06507 0.35074 -0.83442
0.42085
Seed^11 Seed^12 Seed^13
0.53906 -0.29803 -0.77254
One possible solution I found is to rename the categorical variables
seed.str <- paste("S", Loblolly$Seed, sep="")
seed.str <- factor(seed.str)
fit <- lm(height ~ age + seed.str, data=Loblolly)
fit
Call:
lm(formula = height ~ age + seed.str, data = Loblolly)
Coefficients:
(Intercept) age seed.strS303 seed.strS305 seed.strS307
-0.4301 2.5905 0.8600 1.8683 -1.9183
seed.strS309 seed.strS311 seed.strS315 seed.strS319 seed.strS321
0.5350 -1.5933 -0.8867 -0.3650 -2.0350
seed.strS323 seed.strS325 seed.strS327 seed.strS329 seed.strS331
0.3067 -1.3233 -2.6400 -2.9333 -2.2267
Now it is actually possible to see which one is which, but is kind of
lame. Can someone point me to a more elegant solution? Thank you so
much.
Saiwing Yeung
factor with numeric names
4 messages · John Fox, Tal Galili, Saiwing Yeung
Dear Saiwing Yeung, You appear to be using orthogonal-polynomial contrasts (generated by contr.poly) for Seed, which suggests that Seed is either an ordered factor or that you've assigned these contrasts to it. Because Seed has 14 levels, you end up fitting an degree-13 polynomial. If Seed is indeed an ordered factor and you want to use contr.treatment instead then you could, e.g., set Loblolly$Seed <- as.factor(Loblolly$Seed). (If I'm right about Seed being an ordered factor, your solution worked because it changed Seed to a factor, not because it used non-numeric level names.) I hope this helps, John
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On
Behalf Of Saiwing Yeung
Sent: March-21-09 5:02 PM
To: r-help at r-project.org
Subject: [R] factor with numeric names
Hi all,
I have a pretty basic question about categorical variables but I can't
seem to be able to find answer so I am hoping someone here can help. I
found that if the factor names are all in numbers, fitting the model
in lm would return labels that are not very recognizable.
# Example: let's just assume that we want to fit this model
fit <- lm(height ~ age + Seed, data=Loblolly)
# See the category names are all mangled up here
fit
Call:
lm(formula = height ~ age + Seed, data = Loblolly)
Coefficients:
(Intercept) age Seed.L Seed.Q Seed.C
Seed^4
-1.31240 2.59052 4.86941 0.87307 0.37894
-0.46853
Seed^5 Seed^6 Seed^7 Seed^8 Seed^9
Seed^10
0.55237 0.39659 -0.06507 0.35074 -0.83442
0.42085
Seed^11 Seed^12 Seed^13
0.53906 -0.29803 -0.77254
One possible solution I found is to rename the categorical variables
seed.str <- paste("S", Loblolly$Seed, sep="")
seed.str <- factor(seed.str)
fit <- lm(height ~ age + seed.str, data=Loblolly)
fit
Call:
lm(formula = height ~ age + seed.str, data = Loblolly)
Coefficients:
(Intercept) age seed.strS303 seed.strS305 seed.strS307
-0.4301 2.5905 0.8600 1.8683 -1.9183
seed.strS309 seed.strS311 seed.strS315 seed.strS319 seed.strS321
0.5350 -1.5933 -0.8867 -0.3650 -2.0350
seed.strS323 seed.strS325 seed.strS327 seed.strS329 seed.strS331
0.3067 -1.3233 -2.6400 -2.9333 -2.2267
Now it is actually possible to see which one is which, but is kind of
lame. Can someone point me to a more elegant solution? Thank you so
much.
Saiwing Yeung
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090322/c04751bb/attachment-0002.pl>
3 days later
Thank you so much both for the answer. I think I have a better handle
on this now. Yes, Loblolly$Seed is an ordered factor, but I didn't
realize that the default for ordered factor is contr.poly.
And then I was further confused because I didn't realize the
coefficient names generated (not just the model) are different
depending on whether there is an intercept term (even though they were
both "contr.poly").
> lm(formula = height ~ age + Seed, data = Loblolly)
Call:
lm(formula = height ~ age + Seed, data = Loblolly)
Coefficients:
(Intercept) age Seed.L Seed.Q Seed.C
Seed^4
-1.31240 2.59052 4.86941 0.87307 0.37894
-0.46853
Seed^5 Seed^6 Seed^7 Seed^8 Seed^9
Seed^10
0.55237 0.39659 -0.06507 0.35074 -0.83442
0.42085
Seed^11 Seed^12 Seed^13
0.53906 -0.29803 -0.77254
> lm(formula = height ~ age + Seed - 1, data = Loblolly)
Call:
lm(formula = height ~ age + Seed - 1, data = Loblolly)
Coefficients:
age Seed329 Seed327 Seed325 Seed307 Seed331 Seed311
Seed315 Seed321
2.5905 -3.3635 -3.0701 -1.7535 -2.3485 -2.6568 -2.0235
-1.3168 -2.4651
Seed319 Seed301 Seed323 Seed309 Seed303 Seed305
-0.7951 -0.4301 -0.1235 0.1049 0.4299 1.4382
This should have been obvious to me...
(for the sake of completeness) I think factor() doesn't change the
"ordered-ness"
# as.factor(Loblolly$Seed) doesn't remove the ordered-ness
> str(Loblolly$Seed)
Ord.factor w/ 14 levels "329"<"327"<"325"<..: 10 10 10 10 10 10 13
13 13 13 ...
> str(as.factor(Loblolly$Seed))
Ord.factor w/ 14 levels "329"<"327"<"325"<..: 10 10 10 10 10 10 13
13 13 13 ...
# this works though
> str(factor(Loblolly$Seed, ordered=F))
Factor w/ 14 levels "329","327","325",..: 10 10 10 10 10 10 13 13 13
13 ...
Saiwing
On Mar 21, 2009, at 3:35 PM, John Fox wrote:
Dear Saiwing Yeung, You appear to be using orthogonal-polynomial contrasts (generated by contr.poly) for Seed, which suggests that Seed is either an ordered factor or that you've assigned these contrasts to it. Because Seed has 14 levels, you end up fitting an degree-13 polynomial. If Seed is indeed an ordered factor and you want to use contr.treatment instead then you could, e.g., set Loblolly$Seed <- as.factor(Loblolly$Seed). (If I'm right about Seed being an ordered factor, your solution worked because it changed Seed to a factor, not because it used non-numeric level names.) I hope this helps, John
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org ]
On
Behalf Of Saiwing Yeung
Sent: March-21-09 5:02 PM
To: r-help at r-project.org
Subject: [R] factor with numeric names
Hi all,
I have a pretty basic question about categorical variables but I
can't
seem to be able to find answer so I am hoping someone here can
help. I
found that if the factor names are all in numbers, fitting the model
in lm would return labels that are not very recognizable.
# Example: let's just assume that we want to fit this model
fit <- lm(height ~ age + Seed, data=Loblolly)
# See the category names are all mangled up here
fit
Call:
lm(formula = height ~ age + Seed, data = Loblolly)
Coefficients:
(Intercept) age Seed.L Seed.Q Seed.C
Seed^4
-1.31240 2.59052 4.86941 0.87307 0.37894
-0.46853
Seed^5 Seed^6 Seed^7 Seed^8 Seed^9
Seed^10
0.55237 0.39659 -0.06507 0.35074 -0.83442
0.42085
Seed^11 Seed^12 Seed^13
0.53906 -0.29803 -0.77254
One possible solution I found is to rename the categorical variables
seed.str <- paste("S", Loblolly$Seed, sep="")
seed.str <- factor(seed.str)
fit <- lm(height ~ age + seed.str, data=Loblolly)
fit
Call:
lm(formula = height ~ age + seed.str, data = Loblolly)
Coefficients:
(Intercept) age seed.strS303 seed.strS305 seed.strS307
-0.4301 2.5905 0.8600 1.8683 -1.9183
seed.strS309 seed.strS311 seed.strS315 seed.strS319 seed.strS321
0.5350 -1.5933 -0.8867 -0.3650 -2.0350
seed.strS323 seed.strS325 seed.strS327 seed.strS329 seed.strS331
0.3067 -1.3233 -2.6400 -2.9333 -2.2267
Now it is actually possible to see which one is which, but is kind of
lame. Can someone point me to a more elegant solution? Thank you so
much.
Saiwing Yeung
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.