Skip to content

Cross-validation for parameter selection (glm/logit)

4 messages · John, JLucke at ria.buffalo.edu, Steve Lianoglou +1 more

#
If my aim is to select a good subset of parameters for my final logit
model built using glm(). What is the best way to cross-validate the
results so that they are reliable?

Let's say that I have a large dataset of 1000's of observations. I
split this data into two groups, one that I use for training and
another for validation. First I use the training set to build a model,
and the the stepAIC() with a Forward-Backward search. BUT, if I base
my parameter selection purely on this result, I suppose it will be
somewhat skewed due to the 1-time data split (I use only 1 training
dataset)

What is the correct way to perform this variable selection? And are
the readily available packages for this?

Similarly, when I have my final parameter set, how should I go about
and make the final assessment of the models predictability? CV? What
package?


Thank you in advance,
Jay
#
Hi,
On Fri, Apr 2, 2010 at 9:14 AM, Jay <josip.2000 at gmail.com> wrote:
Another approach would be to use penalized regression models.

The glment package has lasso and elasticnet models for both logistic
and "normal" regression models.

Intuitively: in addition to minimizing (say) the squared loss, the
model has to pay some cost (lambda) for including a non-zero parameter
in your model, which in turn provides sparse models.

You ca use CV to fine tune the value for lambda.

If you're not familiar with these penalized models, the glmnet package
has a few references to get you started.

-steve
#
Inline below: 

Bert Gunter
Genentech Nonclinical Statistics

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Steve Lianoglou
Sent: Friday, April 02, 2010 2:34 PM
To: Jay
Cc: r-help at r-project.org
Subject: Re: [R] Cross-validation for parameter selection (glm/logit)

Hi,
On Fri, Apr 2, 2010 at 9:14 AM, Jay <josip.2000 at gmail.com> wrote:
-- Define "good"


What is the best way to cross-validate the

-- Define "best"
-- Define "reliable"

Answers depend on what you mean by these terms. I suggest you consult a
statistician to work with you. These are huge issues for which you would
profit by some guidance.

Cheers,
Bert
Another approach would be to use penalized regression models.

The glment package has lasso and elasticnet models for both logistic
and "normal" regression models.

Intuitively: in addition to minimizing (say) the squared loss, the
model has to pay some cost (lambda) for including a non-zero parameter
in your model, which in turn provides sparse models.

You ca use CV to fine tune the value for lambda.

If you're not familiar with these penalized models, the glmnet package
has a few references to get you started.

-steve