Skip to content

lm Regression takes 24+ GB RAM - Error message

6 messages · R. Michael Weylandt, Jonas125, Milan Bouchet-Valat

#
Hello,

I am a rather unexperienced r-user (learned the language 1 month ago) and
run into the following problem using a local computer with 6 cores & 24 GB
RAM and R 2.15 64-bit. I didn't install any additional packages

1. Via the read.table command I load a data table (with different data
types) which is about 730 MB large
2. I add 2 calculated columns
3. I split the dataset by 5 criteria
4. I run the lm command on the split with the calculated columns as the
variables

The RAM consumption goes rapidly up and stays at 24 GB for a couple of
minutes.
The result:
Error: cannot allocate vector size of 5.0 Mb
In addition: There ware 50 or more warnings (use warnings() to see the first
50)
--> Reached total allocation of 24559Mb

My code works perfectly fine for a smaller dataset. I am surprised about the
errors as the CPU should do all the work with the lm calculations and the
output cannot be that large, can it??? (I cannot check the object size of
the lm object due to the error)

Right now I am running only 1 linear model, but actually I wanted to run 6!

Is Windows putting some restrictions on R regarding the RAM usage? Can I
change any settings?
A RAM upgrade is not an option. Do I need to use a different R package
instead (bigmemory?)?


Thanks in advance for your help!!





--
View this message in context: http://r.789695.n4.nabble.com/lm-Regression-takes-24-GB-RAM-Error-message-tp4660434.html
Sent from the R help mailing list archive at Nabble.com.
#
On Wed, Mar 6, 2013 at 9:51 AM, Jonas125 <schleeberger.j at pg.com> wrote:
So it seems R has access to all your memory.

My guess is that you have so-called "factors" [Categorical variables]
in your dataset and this makes the linear regression a much larger
calculation (in the intermediate steps) than you might realize because
the design matrix has to deal with all the crossed categories.

Can you provide the output of str(DATA_SET)?

MW
Not a bad idea.
#
The datatable (and the split obviously) only contain characters and numeric
data.

I found that 4 regression in a row work if I don't use the calculated
columns as variables but 2 of the original columns. 
RAM usage stays below 3GB!
--> Why does R has such problems with the calculated columns? Their
calculation is already done before the regression starts. 

It's like this:
Create the calculated columns:
Dataset$ExtraColumn1 <- Dataset$ColumnA / Dataset$ColumnB
Dataset$ExtraColumn2 <- Dataset$ColumnC / Dataset$ColumnD

Perform the split of the dataset inc. calculated columns (the criteria for
the split have a hierarchy):
Datasplit <- split(Dataset, paste(Dataset$ColumnE, Dataset$ColumnE))

Perform the regression on the splitted data:
Regression1 <- lapply(Datasplit, function(d) lm(ExtraColumn1 ~ ExtraColumn2,
d, na.action = na.omit, singular.ok = TRUE))

BTW: There are no NA values in the data source.

What is my mistake?

When I calculate the columns I might divide by zero (=inf). Could that
create the problem in the regression?

Thanks,
Jonas








--
View this message in context: http://r.789695.n4.nabble.com/lm-Regression-takes-24-GB-RAM-Error-message-tp4660434p4660496.html
Sent from the R help mailing list archive at Nabble.com.
#
Le mercredi 06 mars 2013 ? 08:31 -0800, Jonas125 a ?crit :
What's the value of length(Datasplit)? Have you tried running
regressions manually on Datasplit[[1]] and calling object.size() on the
result to see how large it is?


Regards
#
Length(Datasplit) = 7100

I did a regression for Datasplit[[1]] and the calculated columns --> the
object size is 70 MB. Quite large....

Assuming that R cannot handle inf values in regressions (didn't have the
time to google it)
How can I avoid the calculation of infinite values? Like "If the denominator
would be zero, choose 0.0000001 as the denominator instead." 
Dataset[is.infinite(Dataset)] <- 0 does not work for me --> "default method
not implemented for type 'list' "
class(Dataset) = data.frame



--
View this message in context: http://r.789695.n4.nabble.com/lm-Regression-takes-24-GB-RAM-Error-message-tp4660434p4660501.html
Sent from the R help mailing list archive at Nabble.com.
#
Le mercredi 06 mars 2013 ? 09:18 -0800, Jonas125 a ?crit :
7100*70/1024 = 485 (GB)

No wonder why you run out of memory quite fast.

You probably do not need to store the whole lm objects: usually you need
coefficients, R-squared, things like that. So instead of returning the
objects, return a vector or a list with only the elements you need, you
will save much space.

And if you really need the objects, set these lm() arguments to FALSE to
make the result smaller:
model, x, y, qr: logicals.  If ?TRUE? the corresponding components of
          the fit (the model frame, the model matrix, the response, the
          QR decomposition) are returned.
I don't understand why you think infinite values can trigger a memory
problem. Why don't you just try it?
Erreur dans lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  NA/NaN/Inf in 'y'
Erreur dans lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  NA/NaN/Inf in 'x'

So, if anything, this would stop your lapply() call sooner or later, and
save your machine from freezing.



Regards