Questions on factors in regression analysis
Thanks!
On Aug 20, 2009, at 1:46 PM, guox at ucalgary.ca wrote:
I got two questions on factors in regression: Q1. In a table, there a few categorical/factor variables, a few numerical variables and the response variable is numeric. Some factors are important but others not. How to determine which categorical variables are significant to the response variable?
Seems that you should engage the services of a consulting statistician for that sort of question. Or post in a venue where statistical consulting is supposed to occur, such as one of the sci.stat.* newsgroups.
I googled sci.stat.* and got sci.stat.math and sci.stat.consult. Are they good? I have no idea to do this. So any clue will be appreciated.
Q2. As we knew, lm can deal with categorical variables. I thought, when there is a categorical predictor, we may use lm directly without quantifying these factors and assigning different values to factors would not change the fittings as shown:
The "numbers" that you are attempting to assign are really just labels for the factor levels. The regression functions in R will not use them for any calculations. They should not be thought of as having "values". Even if the factor is an ordered factor, the labels may not be interpretable as having the same numerical order as the string values might suggest.
x <- 1:20 ## numeric predictor
yes.no <- c("yes","no")
factors <- gl(2,10,20,yes.no) ##factor predictor
factors.quant <- rep(c(18.8,29.9),c(10,10)) ##quantificatio of
factors
Not sure what that is supposed to mean. It is not a factor object even though you may be misleading yourself in to believing it should be. It's a numeric vector.
Yes, levels are not numeric but just labels. But after the levels factors being assigned to numeric values as factors.quant and factors.quant.1, lm(response ~ x + factors.quant) and lm(response ~ x + factors.quant1) produced the same fitted curve as lm(response ~ x + factors). This is what I could not understand.
> str(factors.quant)
num [1:20] 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 ...
factors.quant.1 <- rep(c(16.9,38.9),c(10,10)) ##second quantificatio of factors response <- 0.8*x + 18 + factors.quant + rnorm(20) ##response lm.quant <- lm(response ~ x + factors.quant) ##lm with quantifications lm.fact <- lm(response ~ x + factors) ##lm with factors
> lm.quant
Call:
lm(formula = response ~ x + factors.quant)
Coefficients:
(Intercept) x factors.quant
14.9098 0.5385 1.2350
> lm.fact
Call:
lm(formula = response ~ x + factors)
Coefficients:
(Intercept) x factorsno
38.1286 0.5385 13.7090
lm.quant.1 <- lm(response ~ x + factors.quant.1) ##lm with quantifications
> lm.quant.1
Call:
lm(formula = response ~ x + factors.quant.1)
Coefficients:
(Intercept) x factors.quant.1
27.5976 0.5385 0.6231
lm.fact.1 <- lm(response ~ x + factors) ##lm with factors par(mfrow=c(2,2)) ## comparisons of two fittings plot(x, response) lines(x,fitted(lm.quant),col="blue") grid() plot(x,response) lines(x,fitted(lm.fact),col = "red") grid() plot(x, response) lines(x,fitted(lm.quant.1),lty =2,col="blue") grid() plot(x,response) lines(x,fitted(lm.fact.1),lty =2,col = "red") grid() par(mfrow = c(1,1)) So, is it right that we can assign any numeric values to factors, for example, c(yes, no) = c(18.8,29.9) or (16.9,38.9) in the above, before doing lm, glm, aov, even nls?
You can give factor levels any name you like, including any sequence of digit characters. Unlike "ordinary R where unquoted numbers cannot start variable names, factor functions will coerce numeric vectors to character vectors when assigning level names. But you seem to be conflating factors with numeric vectors that have many ties. Those two entities would have different handling by R's regression functions. -- David Winsemius, MD Heritage Laboratories West Hartford, CT