Skip to content

clara - memory limit

2 messages · Huntsinger, Reid, Martin Maechler

#
I thought setting keep.data=FALSE might help, but running this on a 32-bit
Linux machine, the R process seems to use 1.2 GB until just before clara
returns, when it increases to 1.9 GB, regardless of whether keep.data=FALSE
or TRUE. Possibly it's the overhead of the .C() interface, but that's mostly
an uninformed guess. 

You could sample your data (say half), remove the original, run clara, keep
the mediods, then read your data again and assign each observation to the
nearest mediod. This is what clara does anyway, with much smaller samples by
default.

Reid Huntsinger

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Nestor Fernandez
Sent: Wednesday, August 03, 2005 12:45 PM
To: r-help at stat.math.ethz.ch
Subject: [R] clara - memory limit


Dear all,

I'm trying to estimate clusters from a very large dataset using clara but
the
program stops with a memory error. The (very simple) code and the error:

mydata<-read.dbf(file="fnorsel_4px.dbf")
my.clara.7k<-clara(mydata,k=7)
The dataset contains >3,000,000 rows and 15 columns. I'm using a windows
computer with 1.5G RAM; I also tried changing the memory limit to the
maximum
possible (4000M)
Is there a way to calculate clara clusters from such large datasets?

Thanks a lot.

Nestor.-

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
1 day later
#
ReidH> I thought setting keep.data=FALSE might help, but
    ReidH> running this on a 32-bit Linux machine, the R process
    ReidH> seems to use 1.2 GB until just before clara returns,
    ReidH> when it increases to 1.9 GB, regardless of whether
    ReidH> keep.data=FALSE or TRUE. Possibly it's the overhead
    ReidH> of the .C() interface, but that's mostly an
    ReidH> uninformed guess.

not only;  I've found at least one place to save more memory for 
'keep.data = FALSE',
thanks to your careful observation, Reid!

This, together, with another small change, will lead to a new
release of the cluster package, soon.

Martin Maechler, ETH Zurich