Pre-model Variable Reduction
Hi Frank,
If anyone knows of better references for this please let me know.
Many thanks: I was not aware of the Witten paper. If I turn up anything else I will be sure to let you know. Best Regards, Mark.
Frank E Harrell Jr wrote:
Mark Difford wrote:
Hi All, I beg to differ with Ravi Varadhan's perspective. While it is true that principal component analysis does not itself do variable selection, it is an important method for pointing the way to what to select. This is what the methods in the subselect package rely on. (One of its authors was I believe a student of Jolliffe's). For a modern perspective on this, see the following paper: Debashis Paul, Eric Bair, Trevor Hastie and Robert Tibshirani: "Preconditioning" for feature selection and regression in high-dimensional problems We show that supervised principal components followed by a variable selection procedure is an effective approach for variable selection in very high dimension. Annals of Statistics 36(4), 2008, 1595-1618. http://www-stat.stanford.edu/~hastie/Papers/Preconditioning_Annals.pdf Regards, Mark.
Mark,
Slightly more relevant is the unsupervised sparse principal component
methods described in the following references. If anyone knows of
better references for this please let me know. -Frank
@Article{zou06spa,
author = {Zhou, Hui and Hastie, Trevor and Tibshirani, Robert},
title = {Sparse principal component analysis},
journal = J Comp Graph Stat,
year = 2006,
volume = 15,
pages = {265-286},
annote = {gene microarray;lasso/elastic net;multivariate
analysis;data reduction;singular value
decomposition;thresholding;principal components analysis that shrinks
some loadings to zero}
}
@Article{wit08tes,
author = {Witten, Daniela M. and Tibshirani, Robert},
title = {Testing significance of features by lassoed principal
components},
journal = Annals Appl Stat,
year = 2008,
volume = 2,
number = 3,
pages = {986-1012},
annote = {reduction in false discovery rates over using a vector of
t-statistics;borrowing strength across genes;``one would not expect a
single gene to be associated with the outcome, since, in practice, many
genes work together to effect a particular phenotype. LPC effectively
down-weights individual genes that are associated with the outcome but
that do not share an expression pattern with a larger group of genes,
and instead favors large groups of genes that appear to be
differentially-expressed.'';regress principal components on outcome}
}
Ravi Varadhan wrote:
Principal components analysis does "dimensionality reduction" but NOT "variable reduction". However, Jolliffe's 2004 book on PCA does discuss the problem of selecting a subset of variables, with the goal of representing the internal variation of original multivariate vector as well as possible (see Section 6.3 of that book). I do not think that these methods can handle missing data. The most important issue is to think about the goal of variable reduction and then choose an appropriate optimality criterion for achieving that goal. In most instances of variable selection, the criterion that is optimized is never explicitly considered. Ravi. ---------------------------------------------------------------------------- ------- Ravi Varadhan, Ph.D. Assistant Professor, The Center on Aging and Health Division of Geriatric Medicine and Gerontology Johns Hopkins University Ph: (410) 502-2619 Fax: (410) 614-9625 Email: rvaradhan at jhmi.edu Webpage: http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html ---------------------------------------------------------------------------- -------- -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Gabor Grothendieck Sent: Tuesday, December 09, 2008 8:00 AM To: Harsh Cc: r-help at r-project.org Subject: Re: [R] Pre-model Variable Reduction See: ?prcomp ?princomp On Tue, Dec 9, 2008 at 5:34 AM, Harsh <singhalblr at gmail.com> wrote:
Hello All, I am trying to carry out variable reduction. I do not have information about the dependent variable, and have only the X variables as it were. In selecting variables I wish to keep, I have considered the following
criteria.
1) Percentage of missing value in each column/variable 2) Variance of each variable, with a cut-off value. I recently came across Weka and found that there is an RWeka package which would allow me to make use of Weka through R. Weka provides a "Genetic search" variable reduction method, but I could not find its R code implementation in the RWeka Pdf file on CRAN. I looked for other R packages that allow me to do variable reduction without considering a dependent variable. I came across 'dprep' package but it does not have a Windows implementation. Moreover, I have a dataset that contains continuous and categorical variables, some categorical variables having 3 levels, 10 levels and so on, till a max 50 levels (E.g. States in the USA). Any suggestions in this regard will be much appreciated. Thank you Harsh Singhal Decision Systems, Mu Sigma, Inc.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
View this message in context: http://www.nabble.com/Pre-model-Variable-Reduction-tp20912229p20919501.html Sent from the R help mailing list archive at Nabble.com.