Skip to content

by-group processing

9 messages · Max Webber, Jorge Ivan Velez, jim holtman +5 more

#
Given a dataframe like

  > data
        ID Type N
  1  45900    A 1
  2  45900    B 2
  3  45900    C 3
  4  45900    D 4
  5  45900    E 5
  6  45900    F 6
  7  45900    I 7
  8  49270    A 1
  9  49270    B 2
  10 49270    E 3
  18 46550    A 1
  19 46550    B 2
  20 46550    C 3
  21 46550    D 4
  22 46550    E 5
  23 46550    F 6
  24 46550    I 7
  >

containing an identifier (ID), a variable type code (Type), and
a running count of the number of records per ID (N), how can I
return a dataframe of only those records with the maximum value
of N for each ID? For instance,

  > data
        ID Type N
  7  45900    I 7
  10 49270    E 3
  24 46550    I 7

I know that I can use

   > tapply ( data $ N , data $ ID , max )
   45900 46550 49270
       7     7     3
   >

to get the values of the maximum N for each ID, but how is it
that I can find the index of these values to subsequently use to
subscript data?


--
maxine-webber
#
Max,

Since the dataset is sorted by ID, with ties broken by N, the following
should do it and do it quickly.  It grabs the rows just before ID
changes.
ID Type N
7  45900    I 7
10 49270    E 3
24 46550    I 7

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
#
how about:

d=data[order(data$ID,-data$Type),]
d[!duplicated(d$ID),]
Max Webber wrote:

  
    
#
On Thursday, May 07, 2009 7:45 PM, David Freedman wrote:
> ...how about:
 >   d=data[order(data$ID,-data$Type),]
 >   d[!duplicated(d$ID),]

Does the "-data$Type" argument to the order function work?

--
 David
?
 -----------------------------------------------------
 David Huffer, Ph.D.               Senior Statistician
 CSOSA/Washington, DC           david.huffer at csosa.gov
#
sorry about the mistake - the -data$Type doesn't work: the '-' sign isn't
valid for factors.  I *thought* I had checked this before submitting a
response !
HufferD wrote:

  
    
#
On Wed, May 6, 2009 at 8:12 PM, jim holtman <jholtman at gmail.com> wrote:
Or with plyr:

library(plyr)
ddply(x, "id", tail, 1)

plyr encapsulates the common split-apply-combine strategy and takes
cares of the details for you.  Read more about it on
http://had.co.nz/plyr

Hadley