Lasso with Categorical Variables
On May 2, 2011, at 10:51 AM, Steve Lianoglou wrote:
Hi, On Mon, May 2, 2011 at 12:45 PM, Clemontina Alexander <ckalexa2 at ncsu.edu
wrote: Hi! This is my first time posting. I've read the general rules and guidelines, but please bear with me if I make some fatal error in posting. Anyway, I have a continuous response and 29 predictors made up of continuous variables and nominal and ordinal categorical variables. I'd like to do lasso on these, but I get an error. The way I am using "lars" doesn't allow for the factors. Is there a special option or some other method in order to do lasso with cat. variables? Here is and example (considering ordinal variables as just nominal): set.seed(1) Y <- rnorm(10,0,1) X1 <- factor(sample(x=LETTERS[1:4], size=10, replace = TRUE)) X2 <- factor(sample(x=LETTERS[5:10], size=10, replace = TRUE)) X3 <- sample(x=30:55, size=10, replace=TRUE) # think age X4 <- rchisq(10, df=4, ncp=0) X <- data.frame(X1,X2,X3,X4)
str(X)
'data.frame': 10 obs. of 4 variables: $ X1: Factor w/ 4 levels "A","B","C","D": 4 1 3 1 2 2 1 2 4 2 $ X2: Factor w/ 5 levels "E","F","G","H",..: 3 4 3 2 5 5 5 1 5 3 $ X3: int 51 46 50 44 43 50 30 42 49 48 $ X4: num 2.86 1.55 1.94 2.45 2.75 ... I'd like to do: obj <- lars(x=X, y=Y, type = "lasso") Instead, what I have been doing is converting all data to continuous but I think this is really bad!
Yeah, it is. Check out the "Categorical Predictor Variables" section here for a way to handle such predictor vars: http://www.psychstat.missouristate.edu/multibook/mlt08m.html
Steve's citation is somewhat helpful, but not sufficient to take the
next steps. You can find details regarding the mechanics of typical
linear regression in R on the ?lm page where you find that the factor
variables are typically handled by model.matrix. See below:
> model.matrix(~X1 + X2 + X3 + X4, X)
(Intercept) X1B X1C X1D X2F X2G X2H X2I X3 X4
1 1 0 0 1 0 1 0 0 51 2.8640884
2 1 0 0 0 0 0 1 0 46 1.5462243
3 1 0 1 0 0 1 0 0 50 1.9430901
4 1 0 0 0 1 0 0 0 44 2.4504180
5 1 1 0 0 0 0 0 1 43 2.7535052
6 1 1 0 0 0 0 0 1 50 1.6200326
7 1 0 0 0 0 0 0 1 30 0.5750533
8 1 1 0 0 0 0 0 0 42 5.9224777
9 1 0 0 1 0 0 0 1 49 2.0401528
10 1 1 0 0 0 1 0 0 48 6.2995288
attr(,"assign")
[1] 0 1 1 1 2 2 2 2 3 4
attr(,"contrasts")
attr(,"contrasts")$X1
[1] "contr.treatment"
attr(,"contrasts")$X2
[1] "contr.treatment"
The numeric variables are passed through, while the dummy variables
for factor columns are constructed (as treatment contrasts) and the
whole thing it returned in a neat package.
--
David.
HTH, -steve
David Winsemius, MD Heritage Laboratories West Hartford, CT