Hello, I am a rather unexperienced r-user (learned the language 1 month ago) and run into the following problem using a local computer with 6 cores & 24 GB RAM and R 2.15 64-bit. I didn't install any additional packages 1. Via the read.table command I load a data table (with different data types) which is about 730 MB large 2. I add 2 calculated columns 3. I split the dataset by 5 criteria 4. I run the lm command on the split with the calculated columns as the variables The RAM consumption goes rapidly up and stays at 24 GB for a couple of minutes. The result: Error: cannot allocate vector size of 5.0 Mb In addition: There ware 50 or more warnings (use warnings() to see the first 50) --> Reached total allocation of 24559Mb My code works perfectly fine for a smaller dataset. I am surprised about the errors as the CPU should do all the work with the lm calculations and the output cannot be that large, can it??? (I cannot check the object size of the lm object due to the error) Right now I am running only 1 linear model, but actually I wanted to run 6! Is Windows putting some restrictions on R regarding the RAM usage? Can I change any settings? A RAM upgrade is not an option. Do I need to use a different R package instead (bigmemory?)? Thanks in advance for your help!! -- View this message in context: http://r.789695.n4.nabble.com/lm-Regression-takes-24-GB-RAM-Error-message-tp4660434.html Sent from the R help mailing list archive at Nabble.com.
lm Regression takes 24+ GB RAM - Error message
6 messages · R. Michael Weylandt, Jonas125, Milan Bouchet-Valat
On Wed, Mar 6, 2013 at 9:51 AM, Jonas125 <schleeberger.j at pg.com> wrote:
Hello, I am a rather unexperienced r-user (learned the language 1 month ago) and run into the following problem using a local computer with 6 cores & 24 GB RAM and R 2.15 64-bit. I didn't install any additional packages 1. Via the read.table command I load a data table (with different data types) which is about 730 MB large 2. I add 2 calculated columns 3. I split the dataset by 5 criteria 4. I run the lm command on the split with the calculated columns as the variables The RAM consumption goes rapidly up and stays at 24 GB for a couple of minutes. The result: Error: cannot allocate vector size of 5.0 Mb In addition: There ware 50 or more warnings (use warnings() to see the first 50) --> Reached total allocation of 24559Mb
So it seems R has access to all your memory. My guess is that you have so-called "factors" [Categorical variables] in your dataset and this makes the linear regression a much larger calculation (in the intermediate steps) than you might realize because the design matrix has to deal with all the crossed categories. Can you provide the output of str(DATA_SET)? MW
My code works perfectly fine for a smaller dataset. I am surprised about the errors as the CPU should do all the work with the lm calculations and the output cannot be that large, can it??? (I cannot check the object size of the lm object due to the error) Right now I am running only 1 linear model, but actually I wanted to run 6! Is Windows putting some restrictions on R regarding the RAM usage? Can I change any settings? A RAM upgrade is not an option. Do I need to use a different R package instead (bigmemory?)?
Not a bad idea.
Thanks in advance for your help!! -- View this message in context: http://r.789695.n4.nabble.com/lm-Regression-takes-24-GB-RAM-Error-message-tp4660434.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
The datatable (and the split obviously) only contain characters and numeric data. I found that 4 regression in a row work if I don't use the calculated columns as variables but 2 of the original columns. RAM usage stays below 3GB! --> Why does R has such problems with the calculated columns? Their calculation is already done before the regression starts. It's like this: Create the calculated columns: Dataset$ExtraColumn1 <- Dataset$ColumnA / Dataset$ColumnB Dataset$ExtraColumn2 <- Dataset$ColumnC / Dataset$ColumnD Perform the split of the dataset inc. calculated columns (the criteria for the split have a hierarchy): Datasplit <- split(Dataset, paste(Dataset$ColumnE, Dataset$ColumnE)) Perform the regression on the splitted data: Regression1 <- lapply(Datasplit, function(d) lm(ExtraColumn1 ~ ExtraColumn2, d, na.action = na.omit, singular.ok = TRUE)) BTW: There are no NA values in the data source. What is my mistake? When I calculate the columns I might divide by zero (=inf). Could that create the problem in the regression? Thanks, Jonas -- View this message in context: http://r.789695.n4.nabble.com/lm-Regression-takes-24-GB-RAM-Error-message-tp4660434p4660496.html Sent from the R help mailing list archive at Nabble.com.
Le mercredi 06 mars 2013 ? 08:31 -0800, Jonas125 a ?crit :
The datatable (and the split obviously) only contain characters and numeric data. I found that 4 regression in a row work if I don't use the calculated columns as variables but 2 of the original columns. RAM usage stays below 3GB! --> Why does R has such problems with the calculated columns? Their calculation is already done before the regression starts. It's like this: Create the calculated columns: Dataset$ExtraColumn1 <- Dataset$ColumnA / Dataset$ColumnB Dataset$ExtraColumn2 <- Dataset$ColumnC / Dataset$ColumnD Perform the split of the dataset inc. calculated columns (the criteria for the split have a hierarchy): Datasplit <- split(Dataset, paste(Dataset$ColumnE, Dataset$ColumnE)) Perform the regression on the splitted data: Regression1 <- lapply(Datasplit, function(d) lm(ExtraColumn1 ~ ExtraColumn2, d, na.action = na.omit, singular.ok = TRUE)) BTW: There are no NA values in the data source. What is my mistake?
What's the value of length(Datasplit)? Have you tried running regressions manually on Datasplit[[1]] and calling object.size() on the result to see how large it is? Regards
When I calculate the columns I might divide by zero (=inf). Could that create the problem in the regression? Thanks, Jonas -- View this message in context: http://r.789695.n4.nabble.com/lm-Regression-takes-24-GB-RAM-Error-message-tp4660434p4660496.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Length(Datasplit) = 7100 I did a regression for Datasplit[[1]] and the calculated columns --> the object size is 70 MB. Quite large.... Assuming that R cannot handle inf values in regressions (didn't have the time to google it) How can I avoid the calculation of infinite values? Like "If the denominator would be zero, choose 0.0000001 as the denominator instead." Dataset[is.infinite(Dataset)] <- 0 does not work for me --> "default method not implemented for type 'list' " class(Dataset) = data.frame -- View this message in context: http://r.789695.n4.nabble.com/lm-Regression-takes-24-GB-RAM-Error-message-tp4660434p4660501.html Sent from the R help mailing list archive at Nabble.com.
Le mercredi 06 mars 2013 ? 09:18 -0800, Jonas125 a ?crit :
Length(Datasplit) = 7100 I did a regression for Datasplit[[1]] and the calculated columns --> the object size is 70 MB. Quite large....
7100*70/1024 = 485 (GB)
No wonder why you run out of memory quite fast.
You probably do not need to store the whole lm objects: usually you need
coefficients, R-squared, things like that. So instead of returning the
objects, return a vector or a list with only the elements you need, you
will save much space.
And if you really need the objects, set these lm() arguments to FALSE to
make the result smaller:
model, x, y, qr: logicals. If ?TRUE? the corresponding components of
the fit (the model frame, the model matrix, the response, the
QR decomposition) are returned.
Assuming that R cannot handle inf values in regressions (didn't have the time to google it) How can I avoid the calculation of infinite values? Like "If the denominator would be zero, choose 0.0000001 as the denominator instead." Dataset[is.infinite(Dataset)] <- 0 does not work for me --> "default method not implemented for type 'list' " class(Dataset) = data.frame
I don't understand why you think infinite values can trigger a memory problem. Why don't you just try it?
lm(c(1, Inf) ~ c(1, 2))
Erreur dans lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'y'
lm(c(1, 2) ~ c(1, Inf))
Erreur dans lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'x' So, if anything, this would stop your lapply() call sooner or later, and save your machine from freezing. Regards
-- View this message in context: http://r.789695.n4.nabble.com/lm-Regression-takes-24-GB-RAM-Error-message-tp4660434p4660501.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.