Skip to content

Extracting data from dataframe with tied rows

5 messages · rjb, Peter Alspach, William Dunlap +1 more

rjb
#
Hi R help,

I'm a fairly experienced R user but this manipulation has me stumped, please
help:

DATA
id<-rep(LETTERS[1:5],20)
distance<-rnorm(1:100, mean = 100)
bearing<-sample(1:360,100,replace=T)
month<-sample(1:12,100,replace=T)

I have a dataset with records of individuals (id) , each with a distance
(distance) & direction (bearing) recorded for each month (month).
I want to find the largest distance per individual per month, which is easy
with /tapply/ or /melt/cast (reshape)/,
head(DATA_m<-melt(DATA,id=c("id","month")))
cast(DATA_m,id+month~.,max)
OR
na.omit(melt(tapply(distance,list(id,month),max)))

*BUT THE CATCH IS* ,
I also want the the *corresponding*  bearing for that maximum distance per
month. I've tried the steps above plus using which.max() and loops, but
can't solve the problem. The real dataset is about 6000 rows.

I'm guessing the answer is in finding the row number from the original DATA
but I can't figure how to do that with tapply or melt.

Any suggestions would be greatly appreciated.

John Burnside





--
View this message in context: http://r.789695.n4.nabble.com/Extracting-data-from-dataframe-with-tied-rows-tp4641140.html
Sent from the R help mailing list archive at Nabble.com.
#
Tena koe John

One way:

johnData <- data.frame(id=rep(LETTERS[1:5],20), distance=rnorm(1:100, mean = 100), bearing=sample(1:360,100,replace=T), month=sample(1:12,100,replace=T))
johnAgg <- aggregate(johnData[,'distance'], johnData[,c('id','month')], max)
names(johnAgg)[3] <- 'distance'
merge(johnAgg, johnData)

HTH ....

Peter Alspach

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of rjb
Sent: Friday, 24 August 2012 9:19 a.m.
To: r-help at r-project.org
Subject: [R] Extracting data from dataframe with tied rows

Hi R help,

I'm a fairly experienced R user but this manipulation has me stumped, please
help:

DATA
id<-rep(LETTERS[1:5],20)
distance<-rnorm(1:100, mean = 100)
bearing<-sample(1:360,100,replace=T)
month<-sample(1:12,100,replace=T)

I have a dataset with records of individuals (id) , each with a distance
(distance) & direction (bearing) recorded for each month (month).
I want to find the largest distance per individual per month, which is easy
with /tapply/ or /melt/cast (reshape)/,
head(DATA_m<-melt(DATA,id=c("id","month")))
cast(DATA_m,id+month~.,max)
OR
na.omit(melt(tapply(distance,list(id,month),max)))

*BUT THE CATCH IS* ,
I also want the the *corresponding*  bearing for that maximum distance per
month. I've tried the steps above plus using which.max() and loops, but
can't solve the problem. The real dataset is about 6000 rows.

I'm guessing the answer is in finding the row number from the original DATA
but I can't figure how to do that with tapply or melt.

Any suggestions would be greatly appreciated.

John Burnside





--
View this message in context: http://r.789695.n4.nabble.com/Extracting-data-from-dataframe-with-tied-rows-tp4641140.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

The contents of this e-mail are confidential and may be subject to legal privilege.
 If you are not the intended recipient you must not use, disseminate, distribute or
 reproduce all or any part of this e-mail or attachments.  If you have received this
 e-mail in error, please notify the sender and delete all material pertaining to this
 e-mail.  Any opinion or views expressed in this e-mail are those of the individual
 sender and may not represent those of The New Zealand Institute for Plant and
 Food Research Limited.
#
Or use ave() to compute the within-group ranks (reversed, so max has rank 1) and select
the elements whose ranks are 1:
f2 <- function (DATA) 
{
    stopifnot(is.data.frame(DATA), all(c("distance", "id", "month") %in% 
        names(DATA)))
    revRanks <- ave(DATA[["distance"]], DATA[["id"]], DATA[["month"]], 
        FUN = function(x) rank(-x, ties = "first"))
    DATA[revRanks == 1, ]
}

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
Here's another pretty straightforward solution, using the plyr pkg:

  DF <- data.frame(id, month, distance, bearing)
    # variables as defined in the OP

  require(plyr)
  DF1<-ddply(DF, .(id,month), summarize,
        maxdist = max(distance),
        maxbearing = bearing[which.max(distance)])

Peter Ehlers
On 2012-08-24 09:54, William Dunlap wrote:
#
Another strategy is to sort by month, id, and, in reverse order,
distance and select the rows that start each month/id run.  This
can be much faster than the other ways when there are lots of
month/id combinations.

f1 <- function (DATA) 
{
    stopifnot(is.data.frame(DATA),
                      all(c("distance", "id", "month") %in% names(DATA)))
    DATA <- DATA[order(DATA$month, DATA$id, -DATA$distance), ]
    ldiff <- function(x) c(TRUE, x[-1] != x[-length(x)])
    DATA[ldiff(DATA$month) | ldiff(DATA$id), ]
}

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com