aggregate() runs out of memory

5 messages · Sam Steingold, Steve Lianoglou, William Dunlap +1 more

Original

1

5

Fri, Sep 14, 2012 12:26 PM #

I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17 columns).
I want to get the result of
table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x)
alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is
24.3G, and no end in sight.
both V1 and V2 are characters (not factors).
Is there anything I could do to speed this up?
Thanks.

Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://www.PetitionOnline.com/tap12009/
http://dhimmi.com http://think-israel.org http://iris.org.il
WinWord 6.0 UNinstall: Not enough disk space to uninstall WinWord

Steve Lianoglou

Fri, Sep 14, 2012 12:40 PM #

Hi,

On Fri, Sep 14, 2012 at 3:26 PM, Sam Steingold <sds at gnu.org> wrote:

You might find you'll get a lot of mileage out of data.table when
working with such large data.frames ...

To get something close to what you're after, you can try:

R> library(data.table)
R> Z <- as.data.table(Z)
R> setkeyv(Z, 'V2')
R> agg <- Z[, list(count=.N), by='V2']

R> tab1 <- table(agg$count)

I think that'll get you where you want to be ... I'm ashamed to say
that I haven't really done much w/ aggregate since I mostly have used
plyr and data.table like stuff, so I might be missing your end goal --
providing a reproducible example with a small data.frame from you can
help here (for me at least).

HTH,
-steve

Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Fri, Sep 14, 2012 1:22 PM #

Using data.table will probably speed lots of things up, but also note that
  aggregate(x, FUN=length, by)$x
is a slow way to compute
  table(by)
.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Fri, Sep 14, 2012 1:26 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120914/2ce4186f/attachment.pl>

Steve Lianoglou

Fri, Sep 14, 2012 2:30 PM #

Hi,

On Fri, Sep 14, 2012 at 4:26 PM, Dennis Murphy <djmuser at gmail.com> wrote:

Well done, sir! (slight critique in that .N isn't a function, it's
just a variable that is constantly reset within each by-subset/group)

Also, don't forget to use the .SDcols parameter in [.data.table if you
plan on only using a subset of the columns in side your "by" stuff.

There's lots of documentation in the package `?data.table` and the
vignettes/FAQ to help you tweak your usage, if you decide to take
data.table route.

HTH,
-steve

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact