Skip to content

Does R have function/package works similar to SAS's 'PROC REG'?

4 messages · CZ, Joshua Wiley, Thomas Stewart

CZ
#
Hello,
 
I am working on a variable selection problem and I wonder whether there is
some function or package in R works similar to the 'PROC REG' in SAS?  Thank
you. 

Some facts about 'PROC REG':
PROC REG in SAS first composes a crossproducts matrix. The matrix can be
calculated from input data, reformed from an input correlation matrix, or
read in from an SSCP data set. For each model, the procedure selects the
appropriate crossproducts from the main matrix. The normal equations formed
from the crossproducts are solved by using a sweep algorithm (Goodnight
1979). The method is accurate for data that are reasonably scaled and not
too collinear. 

The sweep algorithm is also used in many places in the model-selection
methods and the RSQUARE method uses the leaps-and-bounds algorithm by
Furnival and Wilson (1974).

Thanks.

Thanks. 
CZ
#
Hi CZ,

The methods may not be the same, but you can use lm() for basic linear
regression, and glm() for general linear models.  Do you have a
particular goal or statistical analysis in mind?

Cheers,

Josh
On Wed, Oct 6, 2010 at 12:50 PM, CZ <cxzhang at ualr.edu> wrote:

  
    
CZ
#
Hi, Josh, 

What we are doing is, we have a microarray data set with 2000 genes and
roughly 60 samples split 2:1 cancer:normal.  So we essentially have one
binary response and 2000 continuous predictors. We want to use this to
develop an ensemble-based classifier method in which the members of the
ensemble are all gene pairs.  To this end, we want to use the Leaps and
Bounds algorithm to obtain the K=200, 500, or 1000 best-performing subsets
of Size=2 Genes to feed into our ensemble.  We had partial success doing
this in SAS, as follows:

1.	the SAS Logistic Procedure (the natural choice for our binary outcome,
because it does logistic regression) would include only the first 60 genes
into the Leaps and Bounds search, and print for each of the remaining genes
a message saying it was a linear combination of the first 60 genes & was
therefore being excluded.   

2.	However, the SAS Reg Procedure (not the natural choice for our binary
outcome, because it does linear regression) would include all 2000 genes
into the Leaps and Bounds search, and not be bothered by the linear
dependencies.  And it gave results that held up quite well in subsequent
analyses.

So, first we want to replicate in R what we did in SAS with the linear
regression, i.e., use the Leaps and Bounds algorithm to obtain the K=200,
500, or 1000 best-performing linear-regression models of Size=2 Genes from
our list of 2000 genes, and not have it exclude genes for being a linear
combination of the basis set.  Then we want to use R to try and do what SAS
could not: get logistic regression to do the same thing and not have it
exclude genes for being a linear combination of the basis set.  

Thanks.