aggregate() runs out of memory

Steve Lianoglou · 2012-09-14T19:40:31Z

Hi, On Fri, Sep 14, 2012 at 3:26 PM, Sam Steingold wrote: > I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17 columns). > I want to get the result of > table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x) > alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is > 24.3G, and no end in sight. > both V1 and V2 are characters (not factors). > Is there anything I could do to speed this up? > Thanks. You might find you'll get a lot of mileage

Steve Lianoglou

Fri, Sep 14, 2012 12:40 PM

Hi,

On Fri, Sep 14, 2012 at 3:26 PM, Sam Steingold <sds at gnu.org> wrote:

You might find you'll get a lot of mileage out of data.table when
working with such large data.frames ...

To get something close to what you're after, you can try:

R> library(data.table)
R> Z <- as.data.table(Z)
R> setkeyv(Z, 'V2')
R> agg <- Z[, list(count=.N), by='V2']

R> tab1 <- table(agg$count)

I think that'll get you where you want to be ... I'm ashamed to say
that I haven't really done much w/ aggregate since I mostly have used
plyr and data.table like stuff, so I might be missing your end goal --
providing a reproducible example with a small data.frame from you can
help here (for me at least).

HTH,
-steve

Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

aggregate() runs out of memory

Thread (5 messages)