Skip to content

Memory problem on a linux cluster using a large data set

3 messages · Iris Kolder, Martin Morgan, Liaw, Andy

#
Iris --

I hope the following helps; I think you have too much data for a
32-bit machine.

Martin

Iris Kolder <iriskolder at yahoo.com> writes:
It seems like a single copy of this data set will be at least a couple
of gigabytes; I think you'll have access to only 4 GB on a 32-bit
machine (see section 8 of the R Installation and Administration guide),
and R will probably end up, even in the best of situations, making at
least a couple of copies of your data. Probably you'll need a 64-bit
machine, or figure out algorithms that work on chunks of data.
This is quite old, and in general it seems like R has become more
sensitive to big-data issues and tracking down unnecessary memory
copying.
use traceback() or options(error=recover) to figure out where this is
actually occurring.
This makes a data.frame, and data frames have several aspects (e.g.,
automatic creation of row names on sub-setting) that can be problematic
in terms of memory use. Probably better to use a matrix, for which:

     'read.table' is not the right tool for reading large matrices,
     especially those with many columns: it is designed to read _data
     frames_ which may have columns of very different classes. Use
     'scan' instead.

(from the help page for read.table). I'm not sure of the details of
the algorithms you'll invoke, but it might be a false economy to try
to get scan to read in 'small' versions (e.g., integer, rather than
numeric) of the data -- the algorithms might insist on numeric data,
and then make a copy during coercion from your small version to
numeric.
This adds a column to the data.frame or matrix, probably causing at
least one copy of the entire data. Create a separate vector instead,
even though this unties the coordination between columns that a data
frame provides.
This will also probably trigger a copy;
R might be clever enough to figure out that this simple assignment
does not trigger a copy. But it probably means that any subsequent
modification of snp.na or SNP *will* trigger a copy, so avoid the
assignment if possible.
Now you're entirely in the hands of the randomForest. If memory
problems occur here, perhaps you'll have gained enough experience to
point the package maintainer to the problem and suggest a possible
solution.
If you mean a pure Fortran solution, including coding the random
forest algorithm, then of course you have complete control over memory
management. You'd still likely be limited to addressing 4 GB of
memory.

  
    
#
In addition to my off-list reply to Iris (pointing her to an old post of
mine that detailed the memory requirement of RF in R), she might
consider the following:

- Use larger nodesize
- Use sampsize to control the size of bootstrap samples

Both of these have the effect of reducing sizes of trees grown.  For a
data set that large, it may not matter to grow smaller trees.

Still, with data of that size, I'd say 64-bit is the better solution.

Cheers,
Andy

From: Martin Morgan
------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}