Message-ID: <4F36B288.2010102@statistik.tu-dortmund.de>
Date: 2012-02-11T18:25:12Z
From: Uwe Ligges
Subject: How to properly build model matrices
In-Reply-To: <CAKxBDU974PReU9p9iZ4xUvWWy3fm-r66pLagZMHLkLEpkearfA@mail.gmail.com>
On 09.02.2012 22:39, Yang Zhang wrote:
> I always bump into a few (very minor) problems when building model
> matrices with e.g.:
>
> train = model.matrix(label~., read.csv('train.csv'))
> target = model.matrix(label~., read.csv('target.csv'))
>
> (1) The two may have different factor levels, yielding different
> matrices. I usually first rbind the data frames together to "meld"
> the factors, and then split them apart and matrixify them.
You can preprocess the data and explicitly define the levels for factor
variables in your data.frames.
> (2) The target set that I'm predicting on typically doesn't have
> labels. I usually manually append dummy labels to the target data
> frame.
R cannot know labels if you do not provide any.
> (3) I almost always remove the Intercept from the model matrices,
> since it seems to always be redundant (I usually use caret).
Then change your model formula to: "label ~ . - 1". But note the
interpretation changes and it is *not* redundant in general.
Uwe Ligges
> None of these is a big deal at all, but I'm just curious if I'm
> missing something simple in how I'm doing things. Thanks.
>