Stepwise SVM Variable selection - R-help

Thu, Jan 6, 2011 11:10 PM #

I have a data set with about 30,000 training cases and 103 variable.

I've trained an SVM (using the e1071 package) for a binary classifier 
{0,1}.  The accuracy isn't great.

I used a grid search over the C and G parameters with an RBF kernel to 
find the best settings.

I remember that for least squares, R has a nice stepwise function that 
will try combining subsets of variables to find the optimal result.  
Clearly, this doesn't exist for SVMs as a built in function.

As an experiment, I simply grabbed the first 50 variables and repeated 
the training/grid search procedure.  The results were significantly 
better.  Since the date is VERY noisy, my guess is that eliminating some 
of the variables eliminated some noise that resulted in better results.

With a grid of 100 parameter settings (10 for C, 10 for G) and 106 
variables, trying every combination would be prohibitively time consuming.

Can anyone suggest an approach to seek the ideal subset of variables for 
my SVM classifier?

Thanks!

Steve Lianoglou

Thu, Jan 6, 2011 11:34 PM #

Hi,

On Fri, Jan 7, 2011 at 2:10 AM, Noah Silverman <noah at smartmediacorp.com> wrote:

Sounds like a job for the types of approaches found in the penalizedSVM package:

http://cran.r-project.org/web/packages/penalizedSVM/index.html

-steve

Steve Lianoglou
Graduate Student: Computational Systems Biology
?| Memorial Sloan-Kettering Cancer Center
?| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Noah Silverman

Thu, Jan 6, 2011 11:52 PM #

I'll give it a try,

Thanks!

-N

On 1/6/11 11:34 PM, Steve Lianoglou wrote:

Georg Ruß

Fri, Jan 7, 2011 2:28 AM #

On 06/01/11 23:10:59, Noah Silverman wrote:

The standard feature selection stuff (backward/forward etc.) is probably
ruled out by the time it takes to compute all the sets and subsets. What
you could try is the following:

First, do a cross-validation setup: split up your data set into a training
and testing set (ratio 0.9 / 0.1 or so).

Second, train your SVM on the training set (try conservative parameters
first).

Third, have your trained SVM classify the test set and compute the
classification error.

Fourth, iterate over all variables and do the following:
  a) choose one variable and permute its values (only) in the test set
  b) have your trained SVM (from step 2) classify this test set and 
  measure the classification error
  c) repeat a) and b) a (high) number of times to be significant 
  d) go to next variable

Fifth, you can get an impression of the importance that one variable has
by comparing the errors generated on the permuted test set for each
variable with the non-permuted test set classification error. If the
permutation of one variable drastically increases the classification
error, the variable is probably important.

Sixth: repeat the cross-validation / random sampling a number of times to
be significant.

This is more like an ad-hoc approach and there are some pitfalls, but the
idea is easily explained and can also be carried over to any other
regression model with cross-validation. The computational burden in SVM is
assumed to be the training and not the prediction step and you only need a
relatively low number of training runs (sixth step) here.

Regards,
Georg.

Research Assistant
Otto-von-Guericke-Universit?t Magdeburg
research at georgruss.de
http://research.georgruss.de