Lasso with Categorical Variables
Thanks for your response, but I guess I didn't make my question clear. I am already familiar with the concept of dummy variables and regression in R. My question is, can the "lars" package (or some other lasso algorithm) handle factors? I did use dummy variables in my original data, but lars (lasso) only shrank the coefficients of some of the levels of one factor to 0. Is this the correct thing to do? Because intuitively it seems like I would want to shrink the whole factor coefficient to 0. If this is correct, what is the interpretation? For example, for X1, if lasso drops the coefficient for levels A and B, but not C and D, does this mean that X1 should be included in the model? Thanks.
On Mon, May 2, 2011 at 2:47 PM, David Winsemius <dwinsemius at comcast.net> wrote:
On May 2, 2011, at 10:51 AM, Steve Lianoglou wrote:
Hi, On Mon, May 2, 2011 at 12:45 PM, Clemontina Alexander <ckalexa2 at ncsu.edu> wrote:
Hi! This is my first time posting. I've read the general rules and guidelines, but please bear with me if I make some fatal error in posting. Anyway, I have a continuous response and 29 predictors made up of continuous variables and nominal and ordinal categorical variables. I'd like to do lasso on these, but I get an error. The way I am using "lars" doesn't allow for the factors. Is there a special option or some other method in order to do lasso with cat. variables? Here is and example (considering ordinal variables as just nominal): set.seed(1) Y <- rnorm(10,0,1) X1 <- factor(sample(x=LETTERS[1:4], size=10, replace = TRUE)) X2 <- factor(sample(x=LETTERS[5:10], size=10, replace = TRUE)) X3 <- sample(x=30:55, size=10, replace=TRUE) ?# think age X4 <- rchisq(10, df=4, ncp=0) X <- data.frame(X1,X2,X3,X4)
str(X)
'data.frame': ? 10 obs. of ?4 variables: ?$ X1: Factor w/ 4 levels "A","B","C","D": 4 1 3 1 2 2 1 2 4 2 ?$ X2: Factor w/ 5 levels "E","F","G","H",..: 3 4 3 2 5 5 5 1 5 3 ?$ X3: int ?51 46 50 44 43 50 30 42 49 48 ?$ X4: num ?2.86 1.55 1.94 2.45 2.75 ... I'd like to do: obj <- lars(x=X, y=Y, type = "lasso") Instead, what I have been doing is converting all data to continuous but I think this is really bad!
Yeah, it is. Check out the "Categorical Predictor Variables" section here for a way to handle such predictor vars: http://www.psychstat.missouristate.edu/multibook/mlt08m.html
Steve's citation is somewhat helpful, but not sufficient to take the next steps. You can find details regarding the mechanics of typical linear regression in R on the ?lm page where you find that the factor variables are typically handled by model.matrix. See below:
model.matrix(~X1 + X2 + X3 + X4, X)
? (Intercept) X1B X1C X1D X2F X2G X2H X2I X3 ? ? ? ?X4 1 ? ? ? ? ? ?1 ? 0 ? 0 ? 1 ? 0 ? 1 ? 0 ? 0 51 2.8640884 2 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 0 ? 0 ? 1 ? 0 46 1.5462243 3 ? ? ? ? ? ?1 ? 0 ? 1 ? 0 ? 0 ? 1 ? 0 ? 0 50 1.9430901 4 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 1 ? 0 ? 0 ? 0 44 2.4504180 5 ? ? ? ? ? ?1 ? 1 ? 0 ? 0 ? 0 ? 0 ? 0 ? 1 43 2.7535052 6 ? ? ? ? ? ?1 ? 1 ? 0 ? 0 ? 0 ? 0 ? 0 ? 1 50 1.6200326 7 ? ? ? ? ? ?1 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 1 30 0.5750533 8 ? ? ? ? ? ?1 ? 1 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 42 5.9224777 9 ? ? ? ? ? ?1 ? 0 ? 0 ? 1 ? 0 ? 0 ? 0 ? 1 49 2.0401528 10 ? ? ? ? ? 1 ? 1 ? 0 ? 0 ? 0 ? 1 ? 0 ? 0 48 6.2995288 attr(,"assign") ?[1] 0 1 1 1 2 2 2 2 3 4 attr(,"contrasts") attr(,"contrasts")$X1 [1] "contr.treatment" attr(,"contrasts")$X2 [1] "contr.treatment" The numeric variables are passed through, while the dummy variables for factor columns are constructed (as treatment contrasts) and the whole thing it returned in a neat package. -- David.
HTH, -steve
-- David Winsemius, MD Heritage Laboratories West Hartford, CT