Skip to content

variable/model selction (step/stepAIC) for biglm ?

9 messages · Charles C. Berry, Tal Galili, Thomas Lumley

#
Hello dear R mailing list members.

I have recently became curious of the possibility applying model
selection algorithms (even as simple as AIC) to regressions of large
datasets. I searched as best as I could, but couldn't find any
reference or wrapper for using step or stepAIC to packages such as
biglm.

Any ideas or directions of how to implement such a concept ?


Best,
Tal
#
On Sat, 21 Feb 2009, Tal Galili wrote:

            
Large in the sense of many observations, one assumes.

But how large in terms of the number of variables??

If not too many variables, then you can form the regression sums of 
squares for all 2^p combinations of regressors from a biglm() fit of all 
variables as biglm provides coef() and vcov() methods.

If it is large, then you most likely will need to do subsampling to reduce 
the number to 'not too many' via lm() and friends then and apply the above 
strategy.

I searched as best as I could, but couldn't find any
Surely any direct implementation of step() would be hopelessly long in 
execution time.


HTH,

Chuck
Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
#
Hi Chuck,

Thanks for the guidelines.

I was hoping someone in the group already experienced handling this
type of task and have some handy code to share.
I'll wait another day or two to see if someone responds with any more
ideas or experience, and if nothing will come up, I might try my hand
in your suggestion

Cheers,
Tal
On Sat, Feb 21, 2009 at 8:09 PM, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:

  
    
#
On Sat, 21 Feb 2009, Charles C. Berry wrote:

            
If you can fit the complete p-variable model (so you have more observations than variables) the search algorithms then don't require the raw data so the search time depends on p but not on n.  That's how the leaps package works, for example.  This is only for lm(), but you get a pretty good approximation for glm() by doing the search using the weighted linear model from the last iteration of IWLS, finding a reasonably large collection of best models, and then refitting them in glm() to see which is really best.

Of course, none of this solves the problem that AIC isn't correctly calibrated for searching large model spaces.


       -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle
#
On Sun, 22 Feb 2009, Tal Galili wrote:

            
If you look at the source code in the leaps package, it first sets up a QR decomposition and then calls search routines.  The QR decomposition code is exactly the same Fortran code as is used by biglm, so you should be able to plug the output of biglm into the subset selection code.

       -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle
#
On Sun, 22 Feb 2009, Tal Galili wrote:

            
regsubsets() in the 'leaps' package has forward, backwards, sequential, and exhaustive search.

          -thomas


Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle