Skip to content

Summary information by groups programming assitance

9 messages · Hadley Wickham, Søren Højsgaard, William Revelle +2 more

#
All - 

I have data that looks like

          psd 	Species Lake Length  Weight    St.weight    Wr
Wr.1     vol
432  substock     SMB      Clear    150   41.00      0.01  95.12438
95.10118  0.0105
433  substock     SMB      Clear    152   39.00      0.01  86.72916
86.70692  0.0105
434  substock     SMB      Clear    152   40.00      3.11  88.95298
82.03689  3.2655
435  substock     SMB      Clear    159   48.00      0.04  92.42095
92.34393  0.0420
436  substock     SMB      Clear    159   48.00      0.01  92.42095
92.40170  0.0105
437  substock     SMB      Clear    165   47.00      0.03  80.38023
80.32892  0.0315
438  substock     SMB      Clear    171   62.00      0.21  94.58105
94.26070  0.2205
439  substock     SMB      Clear    178   70.00      0.01  93.91912
93.90571  0.0105
440  substock     SMB      Clear    179   76.00      1.38 100.15760
98.33895  1.4490
441       S-Q     SMB      Clear    180   75.00      0.01  97.09330
97.08035  0.0105
442       S-Q     SMB      Clear    180   92.00      0.02 119.10111
119.07522  0.0210
...
[truncated] 

where psd and lake are categorical variables, with five and four
categories, respectively.  I'd like to find the maximum vol and the
lengths associated with each maximum vol by each category by each lake.
In other words, I'd like to have a data frame that looks something like 

Lake		Category	Length	vol
Clear		substock	152		3.2655
Clear		S-Q		266		11.73
Clear		Q-P		330		14.89
...
Pickerel	substock	170		3.4965
Pickerel	S-Q		248		10.69
Pickerel	Q-P		335		25.62
Pickerel	P-M		415		32.62
Pickerel	M-T		442		17.25	


In order to originally get this, I used 

with(smb[Lake=="Clear",], tapply(vol, list(Length, psd),max))
with(smb[Lake=="Enemy.Swim",], tapply(vol, list(Length, psd),max))
with(smb[Lake=="Pickerel",], tapply(vol, list(Length, psd),max))
with(smb[Lake=="Roy",], tapply(vol, list(Length, psd),max))

and pulled the values I needed out by hand and put them into a .csv.
Unfortunately, I've got a number of other data sets upon which I'll need
to do the same analysis.  Finding a programmable alternative would
provide a much easier (and likely less error prone) method to achieve
the same results.  Ideally, the "Length" and "vol" data would be in a
data frame such that I could then analyze with nls.  

Does anyone have any thoughts as to how I might accomplish this?  

Thanks in advance, 

Steven Ranney
#
On Mon, Dec 22, 2008 at 3:51 PM, Ranney, Steven
<steven.ranney at montana.edu> wrote:
You might want to have a look at the plyr package,
http://had.co.nz/plyr, which provides a set of tools to make tasks
like this easy.  The are a number of similar examples in the
introductory pdf that should get you started.

Regards,

Hadley
#
Maybe summaryBy (or lapplyBy/splitBy) in the doBy package might help you.
Regards
S?ren

________________________________

Fra: r-help-bounces at r-project.org p? vegne af Ranney, Steven
Sendt: ma 22-12-2008 22:51
Til: r-help at r-project.org
Emne: [R] Summary information by groups programming assitance



All -

I have data that looks like

          psd   Species Lake Length  Weight    St.weight    Wr
Wr.1     vol
432  substock     SMB      Clear    150   41.00      0.01  95.12438
95.10118  0.0105
433  substock     SMB      Clear    152   39.00      0.01  86.72916
86.70692  0.0105
434  substock     SMB      Clear    152   40.00      3.11  88.95298
82.03689  3.2655
435  substock     SMB      Clear    159   48.00      0.04  92.42095
92.34393  0.0420
436  substock     SMB      Clear    159   48.00      0.01  92.42095
92.40170  0.0105
437  substock     SMB      Clear    165   47.00      0.03  80.38023
80.32892  0.0315
438  substock     SMB      Clear    171   62.00      0.21  94.58105
94.26070  0.2205
439  substock     SMB      Clear    178   70.00      0.01  93.91912
93.90571  0.0105
440  substock     SMB      Clear    179   76.00      1.38 100.15760
98.33895  1.4490
441       S-Q     SMB      Clear    180   75.00      0.01  97.09330
97.08035  0.0105
442       S-Q     SMB      Clear    180   92.00      0.02 119.10111
119.07522  0.0210
...
[truncated]

where psd and lake are categorical variables, with five and four
categories, respectively.  I'd like to find the maximum vol and the
lengths associated with each maximum vol by each category by each lake.
In other words, I'd like to have a data frame that looks something like

Lake            Category        Length  vol
Clear           substock        152             3.2655
Clear           S-Q             266             11.73
Clear           Q-P             330             14.89
...
Pickerel        substock        170             3.4965
Pickerel        S-Q             248             10.69
Pickerel        Q-P             335             25.62
Pickerel        P-M             415             32.62
Pickerel        M-T             442             17.25  


In order to originally get this, I used

with(smb[Lake=="Clear",], tapply(vol, list(Length, psd),max))
with(smb[Lake=="Enemy.Swim",], tapply(vol, list(Length, psd),max))
with(smb[Lake=="Pickerel",], tapply(vol, list(Length, psd),max))
with(smb[Lake=="Roy",], tapply(vol, list(Length, psd),max))

and pulled the values I needed out by hand and put them into a .csv.
Unfortunately, I've got a number of other data sets upon which I'll need
to do the same analysis.  Finding a programmable alternative would
provide a much easier (and likely less error prone) method to achieve
the same results.  Ideally, the "Length" and "vol" data would be in a
data frame such that I could then analyze with nls. 

Does anyone have any thoughts as to how I might accomplish this? 

Thanks in advance,

Steven Ranney  

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
Yet another suggestion is describe.by in the psych package.
At 11:25 PM +0100 12/22/08, S?ren H?jsgaard wrote:

  
    
  
#
Here are two solutions assuming DF is your data frame:

# 1. aggregate is in the base of R

aggregate(DF[c("Length", "vol")], DF[c("Lake", "psd")], max)

or the following which is the same except it labels psd as Category:

aggregate(DF[c("Length", "vol")], with(DF, list(Lake = Lake, Category
= psd)), max)


# 2. sqldf.  The sqldf package allows specification using SQL notation:

library|(sqldf)
sqldf("select Lake, psd as Category, max(Length), max(vol) from DF
group by Lake, psd")

There are many other good solutions too using various packages which
have already
been mentioned on this thread.

On Mon, Dec 22, 2008 at 4:51 PM, Ranney, Steven
<steven.ranney at montana.edu> wrote:
#
On Mon, Dec 22, 2008 at 7:15 PM, Ranney, Steven
<steven.ranney at montana.edu> wrote:
Try which.max along with any of the solutions previously mentioned.

Hadley
#
Just sort the data first and then apply any of the solutions but with tail(x, 1)
instead of max, e.g.

DFo <- DF[order(DF$Lake, DF$Length, DF$vol), ]
aggregate(DFo[c("Length", "vol")], DFo[c("Lake", "psd")], tail, 1)


On Mon, Dec 22, 2008 at 8:15 PM, Ranney, Steven
<steven.ranney at montana.edu> wrote:
#
The sorting should have been by Lake, psd and vol (not what I had)
so it should be revised to:

DFo <- DF[order(DF$Lake, DF$psd, DF$vol), ]
aggregate(DFo[c("Length", "vol")], DFo[c("Lake", "psd")], tail, 1)

This is the same as before except DF$psd is used in place of DF$Length
in the first line.

On Mon, Dec 22, 2008 at 9:14 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote: