Skip to content

Subsetting for the ten highest values by group in a dataframe

4 messages · Sam Albers, Phil Spector, Hadley Wickham +1 more

#
Hello,

I am looking for a way to subset a data frame by choosing the top ten
maximum values from that dataframe. As well this occurs within some
factor levels.

## I've used plyr here but I'm not married to this approach
require(plyr)

## I've created a data.frame with two groups and then a id variable (y)
df <- data.frame(x=rnorm(400, mean=20), y=1:400, z=c("A","B"))

## So using ddply I can find the highest value of x
df.max1 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1])

## Or the 2nd highest value
df.max2 <- ddply(df, c("z"), subset, x==sort(x, TRUE)[2])

## And so on.... but when I try to make a series of numbers like so
## to get the top ten values, I don't get a warning message but
## two values that don't really make sense to me
df.max <- ddply(df, c("z"), subset, x==sort(x, TRUE)[1:10])

## So no error message when I use the method above, which is clearly wrong.
## But I really am not sure how to diagnose the problem.

## Can anyone suggest a way to subset a data.frame with groups to
select the top ten max values in that data.frame for each group?

## Thanks so much in advance?

Sam
#
Sam -
    I think that subset is what's throwing you off here --
you need a function that will simply return the 10 rows of
each group with the highest values of x:

function(dat)dat[order(dat$x,decreasing=TRUE)[1:10],]

Then

ddply(df,'z',function(dat)dat[order(dat$x,decreasing=TRUE)[1:10],])

should give you what you want.  In this simple case, you could
also use

do.call(rbind,by(df,df$z,function(dat)dat[order(dat$x,decreasing=TRUE)[1:10],]))

from base R to get the same result.

Hope this helps.
 					- Phil Spector
 					 Statistical Computing Facility
 					 Department of Statistics
 					 UC Berkeley
 					 spector at stat.berkeley.edu
On Fri, 27 Jan 2012, Sam Albers wrote:

            
#
On Fri, Jan 27, 2012 at 1:26 PM, Sam Albers <tonightsthenight at gmail.com> wrote:
Well, sort returns a vector, so you probably want

df.max <- ddply(df, c("z"), subset, x %in% sort(x, TRUE)[1:10])

but it would be better to do (e.g.)

df.max <- ddply(df, c("z"), subset, rank(x) <= 10)

which will also make it possible to deal with ties in a principled way.

Hadley
#
On Fri, Jan 27, 2012 at 2:26 PM, Sam Albers <tonightsthenight at gmail.com> wrote:
(1) Here is a pure R solution.  The result of ave is a vector with the
rank within group of each row so we just take the ones that are 10 or
less:

df.top <- df[ave(-df$x, df$z, FUN = rank) <= 10, ]

We may or may not wish to use the ties.method= argument to rank
depending on how ties are to be handled.  The question did not
indicate that the result needed to be sorted but if it does then
df.top can be sorted like this:

df.top[with(df.top, order(z, -x)), ]

(2) Here is an sqldf solution that uses the default SQLite database:

library(sqldf)
sqldf("select * from df i
   where rowid in
    (select rowid from df where z = i.z order by x desc limit 10)
   order by i.z, i.x desc")

This always gives 10 in each group even if there are ties.  We can
omit the last order by clause if the result need not be sorted.

(3) and here is an sqldf using the PostgreSQL database (which has
additional simplifying features):

library(sqldf)
library(RPostgreSQL)

sqldf("select * from
	(select *, row_number() over  (partition by z order by x desc) rank from df) s
	where rank <= 10
	order by z, rank")

Depending on how we wish to handle ties we may need to replace
row_number() with rank().  Also, we can omit the last order by clause
if the result need not be sorted.