aggregate function - na.action
On Mon, Feb 7, 2011 at 5:54 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
Looking at the timings by each stage may help :
? system.time(dt <- data.table(dat))
? user ?system elapsed ? 1.20 ? ?0.28 ? ?1.48
? system.time(setkey(dt, x1, x2, x3, x4, x5, x6, x7, x8)) ? # sort by the 8 columns (one-off)
? user ?system elapsed ? 4.72 ? ?0.94 ? ?5.67
? system.time(udt <- dt[, list(y = sum(y, na.rm = TRUE)), by = 'x1, x2, x3, x4, x5, x6, x7, x8'])
? user ?system elapsed ? 2.00 ? ?0.21 ? ?2.20 ? ? # compared to 11.07s
data.table doesn't have a custom data structure, so it can't be that. data.table's structure is the same as data.frame i.e. a list of vectors. data.table inherits from data.frame. ?It *is* a data.frame, too. The reasons it is faster in this example include : 1. Memory is only allocated for the largest group. 2. That memory is re-used for each group. 3. Since the data is ordered contiguously in RAM, the memory is copied over in bulk for each group using memcpy in C, which is faster than a for loop in C. Page fetches are expensive; they are minimised.
But this is exactly what I mean by a custom data structure - you're not using the usual data frame API. Wouldn't it be better to implement these changes to data frame so that everyone can benefit? Or is it just too specialised to this particular case (where I guess you're using that the return data structure of the summary function is consistent)? Hadley
Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/