Skip to content

How to apply a function to subsets of a data frame *and* obtain a data frame again?

8 messages · Marius Hofert, Nick Sabbe, Paul Hiemstra +3 more

#
Dear all,

First, let's create some data to play around:

set.seed(1)
(df <- data.frame(Group=rep(c("Group1","Group2","Group3"), each=10), 
                 Value=c(rexp(10, 1), rexp(10, 4), rexp(10, 10)))[sample(1:30,30),])

## Now we need the empirical distribution function:
edf <- function(x) ecdf(x)(x) # empirical distribution function evaluated at x

## The big question is how one can apply the empirical distribution function to 
## each subset of df determined by "Group", so how to apply it to Group1, then
## to Group2, and finally to Group3. You might suggest (?) to use tapply:

(edf. <- tapply(df$Value, df$Group, FUN=edf))

## That's correct. But typically, one would like to obtain not only the values, 
## but a data.frame containing the original information and the new (edf-)values.
## What's a simple way to get this? (one would be required to first sort df 
## according to Group, then paste the values computed by edf to the sorted df; 
## seems a bit tedious). 
## A solution I have is the following (but I would like to know if there is a 
## simpler one):

(edf.. <- do.call("rbind", lapply(unique(df$Group), function(strg){
    subdata <- subset(df, Group==strg) # sub-data
    subdata <- cbind(subdata, edf=edf(subdata$Value))
})) )


Cheers,

Marius
#
You might want to look at package plyr and use ddply.

HTH,


Nick Sabbe
--
ping: nick.sabbe at ugent.be
link: http://biomath.ugent.be
wink: A1.056, Coupure Links 653, 9000 Gent
ring: 09/264.59.36

-- Do Not Disapprove
#
On 08/17/2011 11:24 AM, Nick Sabbe wrote:
The following example does what you want using ddply:

library(plyr)
edfPerGroup = ddply(df, .(Group), summarise, edf = edf(Value), Value =
Value)
Group edf       Value
1  Group1 0.5 0.539682840
2  Group1 0.2 0.145706727
3  Group1 0.7 0.956567494
4  Group1 0.3 0.147045991
5  Group1 0.9 1.229562053
6  Group1 0.4 0.436068626
7  Group1 0.8 1.181642779
8  Group1 0.1 0.139795262
9  Group1 1.0 2.894968537
10 Group1 0.6 0.755181833

cheers,
Paul

  
    
#
Or slightly more succinctly:

ddply(df, .(Group), mutate, edf = edf(Value))

Hadley
#
Dear all, 

thanks a lot for the quick help. 
Below is what I built with the hint of Nick.

Cheers,

Marius


library(plyr)

set.seed(1)
(df <- data.frame(Group=rep(c("Group1","Group2","Group3"), each=10), 
                Value=c(rexp(10, 1), rexp(10, 4), rexp(10, 10)))[sample(1:30,30),])
edf <- function(x) ecdf(x)(x) 

ddply(df, .(Group), function(df.) cbind(df., edf=edf(df.$Value)))
On 2011-08-17, at 13:38 , Hadley Wickham wrote:

            
#
On 08/17/2011 11:51 AM, Marius Hofert wrote:
Hadley's code is much shorter, I would use that syntax.

cheers,
Paul

  
    
#
Have a look at function ave(), e.g.,

set.seed(1)
(df <- data.frame(Group=rep(c("Group1","Group2","Group3"), each=10),
     Value=c(rexp(10, 1), rexp(10, 4), rexp(10, 10)))[sample(1:30,30),])

edf <- function(x) ecdf(x)(x)
df$edf <- with(df, ave(Value, Group, FUN = edf))
df


I hope it helps.

Best,
Dimitris
On 8/17/2011 12:42 PM, Marius Hofert wrote:

  
    
#
Hi:

I would agree with Paul Hiemstra about using Hadley's code instead;
see ?plyr:::mutate for details. It would also make sense to sort the
data and edf by group - this does it in one line:

arrange(ddply(df, .(Group), mutate, edf = edf(Value)), Group, edf)

HTH,
Dennis
On Wed, Aug 17, 2011 at 4:51 AM, Marius Hofert <m_hofert at web.de> wrote: