Message-ID: <XFMail.050505141700.Ted.Harding@nessie.mcc.ac.uk>
Date: 2005-05-05T13:17:00Z
From: (Ted Harding)
Subject: selecting maximum values
In-Reply-To: <Pine.LNX.4.44.0505041925580.23757-100000@reclus.nhh.no>
On 04-May-05 Roger Bivand wrote:
> On Wed, 4 May 2005, Sean Davis wrote:
>
>> see ?aggregate.
>
> Or maybe tapply, or its close relative, by:
>
>> by(df, list(df$station, df$date), function(x)
> + x$row[which.max(x$chlorophyll)])
>: Ancona
>: 21/06/01
> [1] NA
> ------------------------------------------------------------
>: Castagneto
>: 21/06/01
> [1] 3
> ------------------------------------------------------------
>: Ancona
>: 23/06/01
> [1] 6
> ------------------------------------------------------------
>: Castagneto
>: 23/06/01
> [1] NA
>
> since happily a row ID column was included in the data frame. Note that
> which.max only reports the row of the first maximum if there are ties.
I've tried to work out a method which gives a cleaner result
(for instance, the NAs are ugly and unnecessary).
I've called Alessandro's data (below) "chl" (for chlorophyll),
and using Roger's command above assign the result to "tmp":
tmp<-by(chl, list(chl$station, chl$date),
function(x) x$row[which.max(x$chlorophyll)] )
Then, using either tmp[1:2,] or tmp[,1:2] we get
tmp[,1:2]
## 21/06/01 23/06/01
## Ancona NA 6
## Castagneto 3 NA
which is a better layout but still has the NAs.
It would be better to be able to get something like
## Ancona 23/06/01 6
## Castagneto 21/06/01 3
but I don't see how to do it even for just these 2 stations.
Now, however, suppose we want not just the rows but the values
as well. Try a modified function
tmp<-by(chl, list(chl$station, chl$date),
function(x) list(Row=x$row[which.max(x$chlorophyll)],
Val=max(x$chlorophyll))
)
Now
str(tmp)
## List of 4
## $ : NULL
## $ :List of 2
## ..$ Row: int 3
## ..$ Val: num 2.4
## $ :List of 2
## ..$ Row: int 6
## ..$ Val: num 2.5
## $ : NULL
## - attr(*, "dim")= int [1:2] 2 2
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:2] "Ancona" "Castagneto"
## ..$ : chr [1:2] "21/06/01" "23/06/01"
## - attr(*, "call")= language by.data.frame(data = chl, INDICES =
## list(chl$station, chl$date), FUN = function(x) list(Row =
## x$row[which.max(x$chlorophyll)], ...
## - attr(*, "class")= chr "by"
I've not succeeded (though experience tells me that others could)
in extracting from this something like the following:
## Ancona Castagneto
##Row 6 3
##Val 2.5 2.4
##Date 23/06/01 21/06/01
Questions: (a) What's the trick? (b) How to generalise it?
Ted.
>
>>
>> Sean
>>
>> On May 4, 2005, at 11:43 AM, alessandro carletti wrote:
>>
>> > Sorry for disturbing you with another newbie question!
>> > I have a data frame about coastal waters quality
>> > parameters: for some parameters (e.g. NH3) I have only
>> > 1 observation for each sampling station and each
>> > sampling date, while in other cases (chlorophyll) I
>> > have 1 obs for each meter-depth for each station and
>> > date. How can I select only the max chlorophyll value
>> > for each station/date?
>> >
>> > example
>> >
>> > row station date depth chlorophyll
>> > 1 Castagneto 21/06/01 -0.5 2.0
>> > 2 Castagneto 21/06/01 -1.5 2.2
>> > 3 Castagneto 21/06/01 -2.5 2.4
>> > 4 Castagneto 21/06/01 -3.5 2.1
>> > 5 Ancona 23/06/01 -0.5 2.4
>> > 6 Ancona 23/06/01 -1.5 2.5
>> > 7 Ancona 23/06/01 -2.5 2.2
>> > 8 Ancona 23/06/01 -3.5 2.1
>> > 9 Ancona 23/06/01 -4.5 1.9
>> > ...
>> >
>> > I'd like to select only row 3 and 6, the ones with max
>> > chlorophyll values, or have the mean for the rows 1:4
>> > and 5:9
>> >
>> > Thanks
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 05-May-05 Time: 14:13:13
------------------------------ XFMail ------------------------------