Skip to content

Memory Needed for Regression

4 messages · efreeman, Doran, Harold, David Winsemius +1 more

#
The size of the model matrix X can be estimated approximately. It depends on the kind of data in the model matrix. For instance, floating points require more memory than integers (which I think is 8 bits per cell). If your model matrix is sparse, you can use some hidden functions in the matrix package for sparse model matrices and save a lot of memory in doing so, though I am not certain how to estimate memory requirements under such conditions.
#
On Jan 10, 2011, at 5:28 PM, efreeman wrote:

            
figure 10-12 bytes times X * Y as the size of the matrix or dataframe  
and you will need 4-5 times that amount to do useful work,

You can check my guesstimate on one of my objects:

 > object.size(set1HLI)
5907427736 bytes
 > nrow(set1HLI)
[1] 5325006
 > length(set1HLI)
[1] 166

 > 5907427736/5325006
[1] 1109.375
 > 1109.375/166
[1] 6.682982

So I might have been a bit on the high side with my estimate for  
number of bytes per cell. I have a bunch of constructed factor  
variables that only take 4 bytes per "cell". The byte-to-cell ratio is  
8 for "numeric" variables and 4 for "factor" or "integer" variables,  
plus variable amounts for character variables and "overhead". With my  
other computer activities I end up needing about 24 GB which can holds  
probably 10 regression models ... needing space for vectors of  
predicted values and residuals that are as long as the input, and they  
typically run around 300-500MB.
David Winsemius, MD
West Hartford, CT
#
On Mon, 10 Jan 2011, efreeman wrote:

            
install.packages("biglm")
 	require(biglm)

Then see

 	?biglm

"biglm creates a linear model object that uses only p^2 memory for p 
variables. It can be updated with more data using update. This allows 
linear regression on data sets larger than memory."


If you want to get serious about this look in Golub and Van Loan* (Sorry, 
my copy is not at hand so I cannot be more specific. Maybe there is a 
section like "Updating Matrix Factorizations" that says what is needed.)

Also, see

Algorithm AS274 Applied Statistics (1992) Vol.41, No. 2

which is what biglm() refers to. And maybe read the source code of 
biglm() if you are planning on using that package.

HTH,

Chuck

* @book{golub1996matrix,
   title={{Matrix computations}},
   author={Golub, G.H. and Van Loan, C.F.},
   isbn={0801854148},
   year={1996},
   publisher={Johns Hopkins Univ Pr}
}
Charles C. Berry                            Dept of Family/Preventive Medicine
cberry at tajo.ucsd.edu			    UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901