Skip to content

Pre-model Variable Reduction

10 messages · Harsh, Hans W Borchers, Gabor Grothendieck +3 more

#
Hello All,
I am trying to carry out variable reduction. I do not have information
about the dependent variable, and have only the X variables as it
were.
In selecting variables I wish to keep, I have considered the following criteria.
1) Percentage of missing value in each column/variable
2) Variance of each variable, with a cut-off value.

I recently came across Weka and found that there is an RWeka package
which would allow me to make use of Weka through R.
Weka provides a "Genetic search" variable reduction method, but I
could not find its R code implementation in the RWeka Pdf file on
CRAN.

I looked for other R packages that allow me to do variable reduction
without considering a dependent variable. I came across 'dprep'
package but it does not have a Windows implementation.

Moreover, I have a dataset that contains continuous and categorical
variables, some categorical variables having 3 levels, 10 levels and
so on, till a max 50 levels (E.g. States in the USA).

Any suggestions in this regard will be much appreciated.

Thank you

Harsh Singhal
Decision Systems,
Mu Sigma, Inc.
#
Harsh <singhalblr <at> gmail.com> writes:
I doubt that you will find what you are longing for, but: There is a Windows
version available at the "Homepage of the drep package" at
<http://math.uprm.edu/~edgar/dprep.html>. 
This version 2.0 can be loaded without errors into R 2.8.0 though it appears 
not to be fully compliant with the tests on CRAN.
#
Hi Harsh,
Have look at package subselect. This has an implementation of the genetic
algorithm, along with some other methods. It should do what you want.

Regards, Mark.
Harsh-7 wrote:

  
    
#
See:

?prcomp
?princomp
On Tue, Dec 9, 2008 at 5:34 AM, Harsh <singhalblr at gmail.com> wrote:
#
Harsh wrote:
Take a look at the the redun function in the Hmisc package, which does 
redundancy analysis.

Frank
#
Principal components analysis does "dimensionality reduction" but NOT
"variable reduction".  However, Jolliffe's 2004 book on PCA does discuss the
problem of selecting a subset of variables, with the goal of representing
the internal variation of original multivariate vector as well as possible
(see Section 6.3 of that book).  I do not think that these methods can
handle missing data.  The most important issue is to think about the goal of
variable reduction and then choose an appropriate optimality criterion for
achieving that goal.  In most instances of variable selection, the criterion
that is optimized is never explicitly considered.

Ravi.

----------------------------------------------------------------------------
-------

Ravi Varadhan, Ph.D.

Assistant Professor, The Center on Aging and Health

Division of Geriatric Medicine and Gerontology 

Johns Hopkins University

Ph: (410) 502-2619

Fax: (410) 614-9625

Email: rvaradhan at jhmi.edu

Webpage:  http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html

 

----------------------------------------------------------------------------
--------


-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Gabor Grothendieck
Sent: Tuesday, December 09, 2008 8:00 AM
To: Harsh
Cc: r-help at r-project.org
Subject: Re: [R] Pre-model Variable Reduction

See:

?prcomp
?princomp
On Tue, Dec 9, 2008 at 5:34 AM, Harsh <singhalblr at gmail.com> wrote:
criteria.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
Thank you everyone.
The idea really is for me to get the variables themselves from a
super-set of all variables.
x1 -numeric continuous
x2 -numeric continuous
x3 - numeric Factor with 2 levels
x4 -Character Factor with 10 levels
x5 - numeric continuous
x6 - numeric integer

Variable Reduction method then, must ideally give me

keep : x1, x3 and x6
drop : x2, x4 and x5

The 'redun' function from Hmisc package seems promising since it
considers categorical variables as well. Variable to be dropped is the
variable which can be predicted by other variables. I guess its to
check for multi-colinearity.

The RWeka package, as I mentioned earlier, allows one to use Weka's
variable reduction/selection techniques  in R. I did come across an
implementation of the "Genetic Search' method, but have not been able
to find relevant documentation for the same to tweak to suit my needs.

Thank you all for your time.

Harsh Singhal
Decision Systems,
Mu Sigma Inc.
On Tue, Dec 9, 2008 at 8:05 PM, Ravi Varadhan <RVaradhan at jhmi.edu> wrote:
#
Hi All,

I beg to differ with Ravi Varadhan's perspective. While it is true that
principal component analysis does not itself do variable selection, it is an
important method for pointing the way to what to select. This is what the
methods in the subselect package rely on. (One of its authors was I believe
a student of Jolliffe's). For a modern perspective on this, see the
following paper:

Debashis Paul, Eric Bair, Trevor Hastie and Robert Tibshirani:
"Preconditioning" for feature selection and regression in high-dimensional
problems We show that supervised principal components followed by a variable
selection procedure is an effective approach for variable selection in very
high dimension. Annals of Statistics 36(4), 2008, 1595-1618.

http://www-stat.stanford.edu/~hastie/Papers/Preconditioning_Annals.pdf

Regards, Mark.
Ravi Varadhan wrote:

  
    
#
Mark Difford wrote:
Mark,

Slightly more relevant is the unsupervised sparse principal component 
methods described in the following references.  If anyone knows of 
better references for this please let me know.  -Frank


@Article{zou06spa,
   author = 		 {Zhou, Hui and Hastie, Trevor and Tibshirani, Robert},
   title = 		 {Sparse principal component analysis},
   journal = 	 J Comp Graph Stat,
   year = 		 2006,
   volume =		 15,
   pages =		 {265-286},
   annote =		 {gene microarray;lasso/elastic net;multivariate
analysis;data reduction;singular value
decomposition;thresholding;principal components analysis that shrinks
some loadings to zero}
}
@Article{wit08tes,
   author = 		 {Witten, Daniela M. and Tibshirani, Robert},
   title = 		 {Testing significance of features by lassoed principal 
components},
   journal = 	 Annals Appl Stat,
   year = 		 2008,
   volume = 	 2,
   number = 	 3,
   pages = 	 {986-1012},
   annote = 	 {reduction in false discovery rates over using a vector of 
t-statistics;borrowing strength across genes;``one would not expect a 
single gene to be associated with the outcome, since, in practice, many 
genes work together to effect a particular phenotype.  LPC effectively 
down-weights individual genes that are associated with the outcome but 
that do not share an expression pattern with a larger group of genes, 
and instead favors large groups of genes that appear to be 
differentially-expressed.'';regress principal components on outcome}
}

  
    
#
Hi Frank,
Many thanks: I was not aware of the Witten paper. If I turn up anything else
I will be sure to let you know.

Best Regards, Mark.
Frank E Harrell Jr wrote: