Q: R 2.2.1: Memory Management Issues? - R-devel

Thu, Jan 5, 2006 4:33 PM #

Dear Simon,

Thank you for taking time to address my questions.

The empirically derived limit on my machine (under R 1.9.1) was approximately 7500 data points.
I have been able to successfully run the script that uses package MCLUST on several hundred smaller data sets.

I even had written a work-around for the case of greater than 9600 data points.  My work-around first orders the
points by their value then takes a sample (e.g. every other point or 1 point every n points) in order to bring the number under 9600.  No problems with the computations were observed, but you are correct that a deconvolution on that larger dataset of 9600 takes almost 30 minutes.  However, for our purposes, we do not have many datasets over 9600 so the time is not a major constraint.

Unfortunately, my management does not like using a work-around and really wants to operate on the larger data sets.
I was told to find a way to make it operate on the larger data sets or avoid using R and find another solution.

Karen
---
Karen M. Green, Ph.D.
Karen.Green at sanofi-aventis.com
Research Investigator
Drug Design Group
Sanofi Aventis Pharmaceuticals
Tucson, AZ  85737

-----Original Message-----
From: Simon Urbanek [mailto:simon.urbanek at r-project.org]
Sent: Thursday, January 05, 2006 5:13 PM
To: Green, Karen M. PH/US
Cc: R-devel at stat.math.ethz.ch
Subject: Re: [Rd] Q: R 2.2.1: Memory Management Issues?
Importance: High


Karen,

On Jan 5, 2006, at 5:18 PM, <Karen.Green at sanofi-aventis.com>

<Karen.Green at sanofi-aventis.com> wrote:

This is 1.1GB of RAM to allocate alone for one vector(!). As you  
stated yourself the total upper limit is 2GB, so you cannot even fit  
two of those in memory anyway - not much you can do with it even if  
it is allocated.

I suspect that memory is your least problem. Did you even try to run  
EMclust on a small subsample? I suspect that if you did, you would  
figure out that what you are trying to do is not likely to terminate  
within days...

Because that is not the only 1GB vector that is allocated. Your "15GB/ 
defragmented" are irrelevant - if at all, look how much virtual  
memory is set up in you system's preferences.

Well, a toy example of 17000x2 needs 2.3GB and it's unlikely to  
terminate anytime soon, so I'd rather call it shooting with the wrong  
gun. Maybe you should consider different approach to your problem -  
possibly ask at the BioConductor list, because people there have more  
experience with large data and this is not really a technical  
question about R, but rather how to apply statistical methods.

Any reasonable unix will do - technically (64-bit versions  
preferably, but in your case even 32-bit would do). Again, I don't  
think memory is your only problem here, though.

Cheers,
Simon

Simon Urbanek

Thu, Jan 5, 2006 5:38 PM #

On Jan 5, 2006, at 7:33 PM, <Karen.Green at sanofi-aventis.com>

<Karen.Green at sanofi-aventis.com> wrote:

Well, sure, if your only concern is the memory then moving to unix  
will give you several hundred more data points you can use. I would  
recommend a  64-bit unix preferably, because then there is  
practically no software limit on the size of virtual memory.  
Nevertheless there is still a limit of ca. 4GB for a single vector,  
so that should give you around 32500 rows that mclust can handle as- 
is (I don't want to see the runtime, though ;)). For anything else  
you'll really have to think about another approach..

Cheers,
Simon