Given a dataframe like
> data
ID Type N
1 45900 A 1
2 45900 B 2
3 45900 C 3
4 45900 D 4
5 45900 E 5
6 45900 F 6
7 45900 I 7
8 49270 A 1
9 49270 B 2
10 49270 E 3
18 46550 A 1
19 46550 B 2
20 46550 C 3
21 46550 D 4
22 46550 E 5
23 46550 F 6
24 46550 I 7
>
containing an identifier (ID), a variable type code (Type), and
a running count of the number of records per ID (N), how can I
return a dataframe of only those records with the maximum value
of N for each ID? For instance,
> data
ID Type N
7 45900 I 7
10 49270 E 3
24 46550 I 7
I know that I can use
> tapply ( data $ N , data $ ID , max )
45900 46550 49270
7 7 3
>
to get the values of the maximum N for each ID, but how is it
that I can find the index of these values to subsequently use to
subscript data?
--
maxine-webber
by-group processing
9 messages · Max Webber, Jorge Ivan Velez, jim holtman +5 more
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090506/d3b6cd44/attachment-0001.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090506/3c86d7c1/attachment-0001.pl>
Max, Since the dataset is sorted by ID, with ties broken by N, the following should do it and do it quickly. It grabs the rows just before ID changes.
with(data, data[ c(ID[-1] != ID[-length(ID)], TRUE),, drop=FALSE])
ID Type N 7 45900 I 7 10 49270 E 3 24 46550 I 7 Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Max Webber Sent: Wednesday, May 06, 2009 3:09 PM To: r-help at r-project.org Subject: [R] by-group processing Given a dataframe like
> data
ID Type N 1 45900 A 1 2 45900 B 2 3 45900 C 3 4 45900 D 4 5 45900 E 5 6 45900 F 6 7 45900 I 7 8 49270 A 1 9 49270 B 2 10 49270 E 3 18 46550 A 1 19 46550 B 2 20 46550 C 3 21 46550 D 4 22 46550 E 5 23 46550 F 6 24 46550 I 7
>
containing an identifier (ID), a variable type code (Type), and a running count of the number of records per ID (N), how can I return a dataframe of only those records with the maximum value of N for each ID? For instance,
> data
ID Type N 7 45900 I 7 10 49270 E 3 24 46550 I 7 I know that I can use
> tapply ( data $ N , data $ ID , max )
45900 46550 49270
7 7 3
>
to get the values of the maximum N for each ID, but how is it that I can find the index of these values to subsequently use to subscript data? -- maxine-webber
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
how about: d=data[order(data$ID,-data$Type),] d[!duplicated(d$ID),]
Max Webber wrote:
Given a dataframe like
> data
ID Type N 1 45900 A 1 2 45900 B 2 3 45900 C 3 4 45900 D 4 5 45900 E 5 6 45900 F 6 7 45900 I 7 8 49270 A 1 9 49270 B 2 10 49270 E 3 18 46550 A 1 19 46550 B 2 20 46550 C 3 21 46550 D 4 22 46550 E 5 23 46550 F 6 24 46550 I 7
>
containing an identifier (ID), a variable type code (Type), and a running count of the number of records per ID (N), how can I return a dataframe of only those records with the maximum value of N for each ID? For instance,
> data
ID Type N 7 45900 I 7 10 49270 E 3 24 46550 I 7 I know that I can use
> tapply ( data $ N , data $ ID , max )
45900 46550 49270
7 7 3
>
to get the values of the maximum N for each ID, but how is it that I can find the index of these values to subsequently use to subscript data? -- maxine-webber
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
View this message in context: http://www.nabble.com/by-group-processing-tp23417208p23437592.html Sent from the R help mailing list archive at Nabble.com.
On Thursday, May 07, 2009 7:45 PM, David Freedman wrote:
> ...how about: > d=data[order(data$ID,-data$Type),] > d[!duplicated(d$ID),] Does the "-data$Type" argument to the order function work? -- David ? ----------------------------------------------------- David Huffer, Ph.D. Senior Statistician CSOSA/Washington, DC david.huffer at csosa.gov
sorry about the mistake - the -data$Type doesn't work: the '-' sign isn't valid for factors. I *thought* I had checked this before submitting a response !
HufferD wrote:
On Thursday, May 07, 2009 7:45 PM, David Freedman wrote:
> ...how about: > d=data[order(data$ID,-data$Type),] > d[!duplicated(d$ID),]
Does the "-data$Type" argument to the order function work? -- David ? ----------------------------------------------------- David Huffer, Ph.D. Senior Statistician CSOSA/Washington, DC david.huffer at csosa.gov
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
View this message in context: http://www.nabble.com/by-group-processing-tp23417208p23453688.html Sent from the R help mailing list archive at Nabble.com.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090508/8dce5847/attachment-0001.pl>
On Wed, May 6, 2009 at 8:12 PM, jim holtman <jholtman at gmail.com> wrote:
Ths should do it:
do.call(rbind, lapply(split(x, x$ID), tail, 1))
? ? ? ? ID Type N 45900 45900 ? ?I 7 46550 46550 ? ?I 7 49270 49270 ? ?E 3
Or with plyr: library(plyr) ddply(x, "id", tail, 1) plyr encapsulates the common split-apply-combine strategy and takes cares of the details for you. Read more about it on http://had.co.nz/plyr Hadley