Back to formatted view
Raw Message

Message-ID: <4F36B288.2010102@statistik.tu-dortmund.de>
Date: 2012-02-11T18:25:12Z
From: Uwe Ligges
Subject: How to properly build model matrices
In-Reply-To: <CAKxBDU974PReU9p9iZ4xUvWWy3fm-r66pLagZMHLkLEpkearfA@mail.gmail.com>

On 09.02.2012 22:39, Yang Zhang wrote:
> I always bump into a few (very minor) problems when building model
> matrices with e.g.:
>
> train = model.matrix(label~., read.csv('train.csv'))
> target = model.matrix(label~., read.csv('target.csv'))
>
> (1) The two may have different factor levels, yielding different
> matrices.  I usually first rbind the data frames together to "meld"
> the factors, and then split them apart and matrixify them.


You can preprocess the data and explicitly define the levels for factor 
variables in your data.frames.


> (2) The target set that I'm predicting on typically doesn't have
> labels.  I usually manually append dummy labels to the target data
> frame.

R cannot know labels if you do not provide any.

> (3) I almost always remove the Intercept from the model matrices,
> since it seems to always be redundant (I usually use caret).

Then change your model formula to: "label ~ . - 1". But note the 
interpretation changes and it is *not* redundant in general.

Uwe Ligges


> None of these is a big deal at all, but I'm just curious if I'm
> missing something simple in how I'm doing things.  Thanks.
>