Back to formatted view
Raw Message

Message-ID: <f8e6ff050904060756h3347eae9w863894f04906a918@mail.gmail.com>
Date: 2009-04-06T14:56:05Z
From: Hadley Wickham
Subject: SUM,COUNT,AVG
In-Reply-To: <8b356f880904060734l4125954yfc205cf7d2598f11@mail.gmail.com>

On Mon, Apr 6, 2009 at 9:34 AM, Stavros Macrakis <macrakis at alum.mit.edu> wrote:
> There are various ways to do this in R.
>
> # sample data
> dd <- data.frame(a=1:10,b=sample(3,10,replace=T),c=sample(3,10,replace=T))
>
> Using the standard built-in functions, you can use:
>
> *** aggregate ***
>
> aggregate(dd,list(b=dd$b,c=dd$c),sum)
> ?b c ?a b c
> 1 1 1 10 2 2
> 2 2 1 ?3 2 1
> ....
>
> *** tapply ***
>
> tapply(dd$a,interaction(dd$b,dd$c),sum)
> ? ? ?1.1 ? ? ? 2.1 ? ? ? 3.1 ? ? ? 1.2 ? ? ? 2.2 ? ? ? 3.2 ? ? ? 1.3
> 2.3
> ?5.000000 ?3.000000 10.000000 ?5.000000 ? ? ? ?NA ? ? ? ?NA ?5.000000
> ...
>
> But the nicest way is probably to use the plyr package:
>
>> library(plyr)
>> ddply(dd,~b+c,sum)
> ?b c V1
> 1 1 1 14
> 2 2 1 ?6
> ....
>
> ********
>
> Unfortunately, none of these approaches allows you do return more than one
> result from the function, so you'll need to write
>
>> ddply(dd,~b+c,length) ? # count
>> ddply(dd,~b+c,sum)
>> ddply(dd,~b+c,mean) ? # arithmetic average
>
> There is an 'each' function in plyr, but it doesn't seem to be compatible
> with ddply.

That's because ddply applies the function to the whole data frame, not
just the columns that aren't participating in the split.  One way
around it is:

ddply(dd, ~ b + c, function(df) each(length, sum, mean)(df$a))

I haven't figured out a more elegant way to specify this yet.

Hadley

-- 
http://had.co.nz/