Memory problem on a linux cluster using a large data set

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20061218/1d35b181/attachment.pl
Iris --

I hope the following helps; I think you have too much data for a
32-bit machine.

Martin

Iris Kolder <iriskolder at yahoo.com> writes:
Hello,

I have a large data set 320.000 rows and 1000 columns. All the data
has the values 0,1,2.
It seems like a single copy of this data set will be at least a couple
of gigabytes; I think you'll have access to only 4 GB on a 32-bit
machine (see section 8 of the R Installation and Administration guide),
and R will probably end up, even in the best of situations, making at
least a couple of copies of your data. Probably you'll need a 64-bit
machine, or figure out algorithms that work on chunks of data.
on a linux cluster with R version R 2.1.0.  which operates on a 32
This is quite old, and in general it seems like R has become more
sensitive to big-data issues and tracking down unnecessary memory
copying.
?cannot allocate vector size 1240 kb?. I?ve searched through
use traceback() or options(error=recover) to figure out where this is
actually occurring.
SNP <- read.table("file.txt", header=FALSE, sep="")    # read in data file
This makes a data.frame, and data frames have several aspects (e.g.,
automatic creation of row names on sub-setting) that can be problematic
in terms of memory use. Probably better to use a matrix, for which:

     'read.table' is not the right tool for reading large matrices,
     especially those with many columns: it is designed to read _data
     frames_ which may have columns of very different classes. Use
     'scan' instead.

(from the help page for read.table). I'm not sure of the details of
the algorithms you'll invoke, but it might be a false economy to try
to get scan to read in 'small' versions (e.g., integer, rather than
numeric) of the data -- the algorithms might insist on numeric data,
and then make a copy during coercion from your small version to
numeric.
SNP$total.NAs = rowSums(is.na(SN         # calculate the number of NA per row and adds a colum with total Na's
This adds a column to the data.frame or matrix, probably causing at
least one copy of the entire data. Create a separate vector instead,
even though this unties the coordination between columns that a data
frame provides.
SNP  = t(as.matrix(SNP))                          # transpose rows and columns
This will also probably trigger a copy;
snp.na<-SNP
R might be clever enough to figure out that this simple assignment
does not trigger a copy. But it probably means that any subsequent
modification of snp.na or SNP *will* trigger a copy, so avoid the
assignment if possible.
snp.roughfix<-na.roughfix(snp.na)                                             
fSNP<-factor(snp.roughfix[, 1])                # Asigns factor to case control status

snp.narf<- randomForest(snp.roughfix[,-1], fSNP, na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, keep.forest=FALSE, do.trace=100)
Now you're entirely in the hands of the randomForest. If memory
problems occur here, perhaps you'll have gained enough experience to
point the package maintainer to the problem and suggest a possible
solution.
set it should be able to cope with that amount. Perhaps someone has
tried this before in R or is Fortram a better choice? I added my R
If you mean a pure Fortran solution, including coding the random
forest algorithm, then of course you have complete control over memory
management. You'd still likely be limited to addressing 4 GB of
memory.
I wrote a script to remove all the rows with more than 46 missing
values. This works perfect on a smaller dataset. But the problem
arises when I try to run it on the larger data set I get an error
?cannot allocate vector size 1240 kb?. I?ve searched through
previous posts and found out that it might be because i?m running it
on a linux cluster with R version R 2.1.0.  which operates on a 32
bit processor. But I could not find a solution for this problem. The
cluster is a really fast one and should be able to cope with these
large amounts of data the systems configuration are Speed: 3.4 GHz,
memory 4GByte. Is there a way to change the settings or processor
under R? I want to run the function Random Forest on my large data
set it should be able to cope with that amount. Perhaps someone has
tried this before in R or is Fortram a better choice? I added my R
script down below.

Best regards,

Iris Kolder

SNP <- read.table("file.txt", header=FALSE, sep="")    # read in data file
SNP[SNP==9]<-NA                                   # change missing values from a 9 to a NA
SNP$total.NAs = rowSums(is.na(SN         # calculate the number of NA per row and adds a colum with total Na's
SNP = SNP[ SNP$total.NAs < 46,  ]         # create a subset with no more than 5%(46) NA's 
SNP$total.NAs=NULL                              # remove added column with sum of NA's
SNP  = t(as.matrix(SNP))                          # transpose rows and columns
set.seed(1)                                                                                   
snp.na<-SNP 
snp.roughfix<-na.roughfix(snp.na)                                             
fSNP<-factor(snp.roughfix[, 1])                # Asigns factor to case control status

snp.narf<- randomForest(snp.roughfix[,-1], fSNP, na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, keep.forest=FALSE, do.trace=100)

print(snp.narf)

__________________________________________________

	[[alternative HTML version deleted]]

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Martin T. Morgan
Bioconductor / Computational Biology
http://bioconductor.org
In addition to my off-list reply to Iris (pointing her to an old post of
mine that detailed the memory requirement of RF in R), she might
consider the following:

- Use larger nodesize
- Use sampsize to control the size of bootstrap samples

Both of these have the effect of reducing sizes of trees grown.  For a
data set that large, it may not matter to grow smaller trees.

Still, with data of that size, I'd say 64-bit is the better solution.

Cheers,
Andy

From: Martin Morgan
Iris --

I hope the following helps; I think you have too much data 
for a 32-bit machine.

Martin

Iris Kolder <iriskolder at yahoo.com> writes:

Hello,

I have a large data set 320.000 rows and 1000 columns. All the data 
has the values 0,1,2.
It seems like a single copy of this data set will be at least 
a couple of gigabytes; I think you'll have access to only 4 
GB on a 32-bit machine (see section 8 of the R Installation 
and Administration guide), and R will probably end up, even 
in the best of situations, making at least a couple of copies 
of your data. Probably you'll need a 64-bit machine, or 
figure out algorithms that work on chunks of data.

on a linux cluster with R version R 2.1.0.  which operates on a 32
This is quite old, and in general it seems like R has become 
more sensitive to big-data issues and tracking down 
unnecessary memory copying.

"cannot allocate vector size 1240 kb". I've searched through
use traceback() or options(error=recover) to figure out where 
this is actually occurring.

SNP <- read.table("file.txt", header=FALSE, sep="")    # 
read in data file

This makes a data.frame, and data frames have several aspects 
(e.g., automatic creation of row names on sub-setting) that 
can be problematic in terms of memory use. Probably better to 
use a matrix, for which:

     'read.table' is not the right tool for reading large matrices,
     especially those with many columns: it is designed to read _data
     frames_ which may have columns of very different classes. Use
     'scan' instead.

(from the help page for read.table). I'm not sure of the 
details of the algorithms you'll invoke, but it might be a 
false economy to try to get scan to read in 'small' versions 
(e.g., integer, rather than
numeric) of the data -- the algorithms might insist on 
numeric data, and then make a copy during coercion from your 
small version to numeric.

SNP$total.NAs = rowSums(is.na(SN         # calculate the 
number of NA per row and adds a colum with total Na's

This adds a column to the data.frame or matrix, probably 
causing at least one copy of the entire data. Create a 
separate vector instead, even though this unties the 
coordination between columns that a data frame provides.

SNP  = t(as.matrix(SNP))                          # 
transpose rows and columns

This will also probably trigger a copy; 

snp.na<-SNP
R might be clever enough to figure out that this simple 
assignment does not trigger a copy. But it probably means 
that any subsequent modification of snp.na or SNP *will* 
trigger a copy, so avoid the assignment if possible.

snp.roughfix<-na.roughfix(snp.na)                           

fSNP<-factor(snp.roughfix[, 1])                # Asigns 
factor to case control status

snp.narf<- randomForest(snp.roughfix[,-1], fSNP, 
na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, 
keep.forest=FALSE, do.trace=100)
Now you're entirely in the hands of the randomForest. If 
memory problems occur here, perhaps you'll have gained enough 
experience to point the package maintainer to the problem and 
suggest a possible solution.

set it should be able to cope with that amount. Perhaps someone has 
tried this before in R or is Fortram a better choice? I added my R
If you mean a pure Fortran solution, including coding the 
random forest algorithm, then of course you have complete 
control over memory management. You'd still likely be limited 
to addressing 4 GB of memory. 

I wrote a script to remove all the rows with more than 46 missing 
values. This works perfect on a smaller dataset. But the problem 
arises when I try to run it on the larger data set I get an error 
"cannot allocate vector size 1240 kb". I've searched 
through previous 
posts and found out that it might be because i'm running it 
on a linux 
cluster with R version R 2.1.0.  which operates on a 32 bit 
processor. 
But I could not find a solution for this problem. The cluster is a 
really fast one and should be able to cope with these large 
amounts of 
data the systems configuration are Speed: 3.4 GHz, memory 
4GByte. Is 
there a way to change the settings or processor under R? I 
want to run 
the function Random Forest on my large data set it should 
be able to 
cope with that amount. Perhaps someone has tried this 
before in R or 
is Fortram a better choice? I added my R script down below.

Best regards,

Iris Kolder

SNP <- read.table("file.txt", header=FALSE, sep="")    # 
read in data file
SNP[SNP==9]<-NA                                   # change 
missing values from a 9 to a NA
SNP$total.NAs = rowSums(is.na(SN         # calculate the 
number of NA per row and adds a colum with total Na's
SNP = SNP[ SNP$total.NAs < 46,  ]         # create a subset 
with no more than 5%(46) NA's 
SNP$total.NAs=NULL                              # remove 
added column with sum of NA's
SNP  = t(as.matrix(SNP))                          # 
transpose rows and columns
set.seed(1)                                                 

snp.na<-SNP 
snp.roughfix<-na.roughfix(snp.na)                           

fSNP<-factor(snp.roughfix[, 1])                # Asigns 
factor to case control status

snp.narf<- randomForest(snp.roughfix[,-1], fSNP, 
na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, 
keep.forest=FALSE, do.trace=100)

print(snp.narf)

__________________________________________________

	[[alternative HTML version deleted]]

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Martin T. Morgan
Bioconductor / Computational Biology
http://bioconductor.org

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}