Back to formatted view
Raw Message

Message-ID: <a1a6c223-cb75-4f0c-9d39-3062fcc190bf@t8g2000prh.googlegroups.com>
Date: 2011-01-26T10:39:37Z
From: analyst41 at hotmail.com
Subject: hwo to speed up "aggregate"

I have

> df
   quantity branch client       date  name
1        10      1      1 2010-01-01   one
2        20      2      1 2010-01-01   one
3        30      3      2 2010-01-01   two
4        15      4      1 2010-01-01   one
5        10      5      2 2010-01-01   two
6        20      6      3 2010-01-01 three
7      1000      1      1 2011-01-01   one
8      2000      2      1 2011-01-01   one
9      3000      3      2 2011-01-01   two
10     1500      4      1 2011-01-01   one
11     1000      5      2 2011-01-01   two
12     2000      6      3 2011-01-01 three

I want to aggregate away the branch. I followed a suggestion by Gabor
(thanks) and did

> aggregate(list(quantity=df$quantity),list(client=df$client,date=df$date),sum)
  client       date quantity
1      1 2010-01-01       45
2      2 2010-01-01       40
3      3 2010-01-01       20
4      1 2011-01-01     4500
5      2 2011-01-01     4000
6      3 2011-01-01     2000

I want df$name also in the output and did what looked obvious:

> aggregate(list(quantity=df$quantity),list(client=df$client,date=df$date,name=df$name),sum)
  client       date  name quantity
1      1 2010-01-01   one       45
2      1 2011-01-01   one     4500
3      3 2010-01-01 three       20
4      3 2011-01-01 three     2000
5      2 2010-01-01   two       40
6      2 2011-01-01   two     4000

It seems to work, but slows down tremendously for a dataframe with
around a 1000 rows.

Could anyone explain what is going on and suggest a way out?

Thanks.