Iris --
I hope the following helps; I think you have too much data
for a 32-bit machine.
Martin
Iris Kolder <iriskolder at yahoo.com> writes:
Hello,
I have a large data set 320.000 rows and 1000 columns. All the data
has the values 0,1,2.
It seems like a single copy of this data set will be at least
a couple of gigabytes; I think you'll have access to only 4
GB on a 32-bit machine (see section 8 of the R Installation
and Administration guide), and R will probably end up, even
in the best of situations, making at least a couple of copies
of your data. Probably you'll need a 64-bit machine, or
figure out algorithms that work on chunks of data.
on a linux cluster with R version R 2.1.0. which operates on a 32
This is quite old, and in general it seems like R has become
more sensitive to big-data issues and tracking down
unnecessary memory copying.
"cannot allocate vector size 1240 kb". I've searched through
use traceback() or options(error=recover) to figure out where
this is actually occurring.
SNP <- read.table("file.txt", header=FALSE, sep="") #
read in data file
This makes a data.frame, and data frames have several aspects
(e.g., automatic creation of row names on sub-setting) that
can be problematic in terms of memory use. Probably better to
use a matrix, for which:
'read.table' is not the right tool for reading large matrices,
especially those with many columns: it is designed to read _data
frames_ which may have columns of very different classes. Use
'scan' instead.
(from the help page for read.table). I'm not sure of the
details of the algorithms you'll invoke, but it might be a
false economy to try to get scan to read in 'small' versions
(e.g., integer, rather than
numeric) of the data -- the algorithms might insist on
numeric data, and then make a copy during coercion from your
small version to numeric.
SNP$total.NAs = rowSums(is.na(SN # calculate the
number of NA per row and adds a colum with total Na's
This adds a column to the data.frame or matrix, probably
causing at least one copy of the entire data. Create a
separate vector instead, even though this unties the
coordination between columns that a data frame provides.
SNP = t(as.matrix(SNP)) #
transpose rows and columns
This will also probably trigger a copy;
R might be clever enough to figure out that this simple
assignment does not trigger a copy. But it probably means
that any subsequent modification of snp.na or SNP *will*
trigger a copy, so avoid the assignment if possible.
snp.roughfix<-na.roughfix(snp.na)
fSNP<-factor(snp.roughfix[, 1]) # Asigns
factor to case control status
snp.narf<- randomForest(snp.roughfix[,-1], fSNP,
na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE,
keep.forest=FALSE, do.trace=100)
Now you're entirely in the hands of the randomForest. If
memory problems occur here, perhaps you'll have gained enough
experience to point the package maintainer to the problem and
suggest a possible solution.
set it should be able to cope with that amount. Perhaps someone has
tried this before in R or is Fortram a better choice? I added my R
If you mean a pure Fortran solution, including coding the
random forest algorithm, then of course you have complete
control over memory management. You'd still likely be limited
to addressing 4 GB of memory.
I wrote a script to remove all the rows with more than 46 missing
values. This works perfect on a smaller dataset. But the problem
arises when I try to run it on the larger data set I get an error
"cannot allocate vector size 1240 kb". I've searched
posts and found out that it might be because i'm running it
cluster with R version R 2.1.0. which operates on a 32 bit
But I could not find a solution for this problem. The cluster is a
really fast one and should be able to cope with these large
data the systems configuration are Speed: 3.4 GHz, memory
there a way to change the settings or processor under R? I
the function Random Forest on my large data set it should
cope with that amount. Perhaps someone has tried this
is Fortram a better choice? I added my R script down below.
Best regards,
Iris Kolder
SNP <- read.table("file.txt", header=FALSE, sep="") #
missing values from a 9 to a NA
SNP$total.NAs = rowSums(is.na(SN # calculate the
number of NA per row and adds a colum with total Na's
SNP = SNP[ SNP$total.NAs < 46, ] # create a subset
with no more than 5%(46) NA's
SNP$total.NAs=NULL # remove
added column with sum of NA's
SNP = t(as.matrix(SNP)) #
transpose rows and columns
snp.na<-SNP
snp.roughfix<-na.roughfix(snp.na)
fSNP<-factor(snp.roughfix[, 1]) # Asigns
factor to case control status
snp.narf<- randomForest(snp.roughfix[,-1], fSNP,
na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE,
keep.forest=FALSE, do.trace=100)
print(snp.narf)