Using plyr::dply more (memory) efficiently?
"Steve Lianoglou" <mailinglist.honeypot at gmail.com> wrote in message news:t2ybbdc7ed01004290812n433515b5vb15b49c170f5a353 at mail.gmail.com...
Thanks for directing me to the data.table package. I read through some of the vignettes, and it looks quite nice. While your sample code would provide answer if I wanted to just compute some summary statistic/function of groups of my data.frame (using `by=symbol`), what's the best way to produces several pieces of info per subset. For instance, I see that I can do something like this: summaries[, list(counts=sum(counts), width=sum(exon.width)), by=symbol]
Yes, thats it.
But what if I need to do some more complex processing within the
subsets defined in `by=symbol` -- like several lines of programming
logic for 1 result, say.
I guess I can open a new block that just returns a data.table? Like:
summaries[, {
cnts <- sum(counts)
ew <- sum(exon.width)
# ... some complex things
complex <- # .. result of complex things
data.table(counts=cnts, width=ew, cplx=complex)
}, by=symbol]
Is that right? (I mean, it looks like it's working, but maybe there's
a more idiomatic way(?))
Yes, you got it. Rather than a data.table at the end though, just return a
list, its faster.
Shorter vectors will still be recycled to match any longer ones.
Or just this :
summaries[, list(
counts = sum(counts),
width = sum(exon.width),
cplx = # .. result of complex things
), by=symbol]
Sounds like its working, but could you give us an idea whether it is quick
and memory efficient ?