clara - memory limit - R-help

Wed, Aug 3, 2005 9:44 AM #

Dear all,

I'm trying to estimate clusters from a very large dataset using clara but the
program stops with a memory error. The (very simple) code and the error:

mydata<-read.dbf(file="fnorsel_4px.dbf")
my.clara.7k<-clara(mydata,k=7)

The dataset contains >3,000,000 rows and 15 columns. I'm using a windows
computer with 1.5G RAM; I also tried changing the memory limit to the maximum
possible (4000M)
Is there a way to calculate clara clusters from such large datasets?

Thanks a lot.

Nestor.-

Brian Ripley

Wed, Aug 3, 2005 10:18 AM #

'clara' is fully described in chapter 3 of Kaufman and Rousseeuw
      (1990). Compared to other partitioning methods such as 'pam', it
      can deal with much larger datasets.  Internally, this is achieved
      by considering sub-datasets of fixed size ('sampsize') such that
      the time and storage requirements become linear in n rather than
      quadratic.

and the default for 'sampsize' is apparently at least nrow(x).

So you need to set 'sampsize' (and perhaps 'samples') appropriately,

On Wed, 3 Aug 2005, Nestor Fernandez wrote:

Actually, the limit is probably 2048M: see the rw-FAQ Q on memory limits.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Martin Maechler

Wed, Aug 3, 2005 10:23 AM #

Nestor> I'm trying to estimate clusters from a
    Nestor> very large dataset using clara but the program stops
    Nestor> with a memory error. The (very simple) code and the
    Nestor> error:

    Nestor> mydata<-read.dbf(file="fnorsel_4px.dbf")
    Nestor> my.clara.7k<-clara(mydata,k=7)

    >> Error: cannot allocate vector of size 465108 Kb

    Nestor> The dataset contains >3,000,000 rows and 15
    Nestor> columns. I'm using a windows computer with 1.5G RAM;
    Nestor> I also tried changing the memory limit to the
    Nestor> maximum possible (4000M) Is there a way to calculate
    Nestor> clara clusters from such large datasets?

One way to start is reading the help   ?clara  more carefully
and hence use

    clara(mydata, k=7, keep.data = FALSE)
		     ^^^^^^^^^^^^^^^^^^^

But that might not be enough:
You may need 64-bit CPU and an operating system (with system
libraries and an R version) that uses 64-bit addressing, i.e.,
not any current version of M$ Windows.

   Nestor> Thanks a lot.

you're welcome.

Martin Maechler, ETH Zurich

Brian Ripley

Wed, Aug 3, 2005 10:26 AM #

On Wed, 3 Aug 2005, Prof Brian Ripley wrote:

Correction, sorry, in your case 40 + 2*k = 54.

That might be it, but a traceback() showing where the error is occurring 
would help.  Another possible place is in the initial manipulations 
scaling the data matrix.

Since sub-sampling is used, you can start with a much smaller subset of 
the data.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595