Skip to content

aggregate slow with many rows - alternative?

3 messages · Hans-Peter, Gabor Grothendieck, Frank E Harrell Jr

#
Hi,

I use the code below to aggregate / cnt my test data. It works fine,
but the problem is with my real data (33'000 rows) where the function
is really slow (nothing happened in half an hour).

Does anybody know of other functions that I could use?

Thanks,
Hans-Peter

--------------
dat <- data.frame( Datum  = c( 32586, 32587, 32587, 32625, 32656,
32656, 32656, 32672, 32672, 32699 ),
              FischerID = c( 58395, 58395, 58395, 88434, 89953, 89953,
89953, 64395, 62896, 62870 ),
              Anzahl = c( 2, 2, 1, 1, 2, 1, 7, 1, 1, 2 ) )
f <- function(x) data.frame( Datum = x[1,1], FischerID = x[1,2],
Anzahl = sum( x[,3] ), Cnt = dim( x )[1] )
t.a <- do.call("rbind", by(dat, dat[,1:2], f))   # slow for 33'000 rows
t.a <- t.a[order( t.a[,1], t.a[,2] ),]

  # show data
dat
t.a
#
Convert dat to a matrix and see if working with the
matrix instead of a data frame speeds things up
enough.
On 10/13/05, Hans-Peter <gchappi at gmail.com> wrote:
#
Gabor Grothendieck wrote:
In the Hmisc package the asNumericMatrix and matrix2dataFrame functions 
facilite this.

Also look at the summarize and mApply functions in Hmisc, which can be 
quite fast.

Frank Harrell