mean for subset

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20100105/13f51d5f/attachment.pl>
Try this:

with(split(DF, with(DF, ave(SCORE, NAME, FUN = length)))[['3']],
tapply(SCORE, NAME[,drop = TRUE], FUN = mean))

Or:

 with(DF, tapply(SCORE, NAME, mean))[table(DF$NAME) == 3]
Hello, does anyone know how to take the mean for a subset of observations?
For example, suppose my data looks like this:

OBS ? ? NAME ? SCORE
1 ? ? ? ? ?Tom ? ? ? 92
2 ? ? ? ? ?Tom ? ? ? 88
3 ? ? ? ? ?Tom ? ? ? 56
4 ? ? ? ? ?James ? ?85
5 ? ? ? ? ?James ? ?75
6 ? ? ? ? ?James ? ?32
7 ? ? ? ? ?Dawn ? ? 56
8 ? ? ? ? ?Dawn ? ? 91
9 ? ? ? ? ?Clara ? ? 95
10 ? ? ? ?Clara ? ? 84

Is there a way to get the mean of the SCORE variable by NAME but only when
the number of observations is equal to 3? ?In other words, is there a way to
get the mean of the SCORE variable for Tom and James, but not for Dawn and
Clara? ?Thank you.

--
Geoffrey Smith
Visiting Assistant Professor
Department of Finance
W. P. Carey School of Business
Arizona State University

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Henrique Dallazuanna
Curitiba-Paran?-Brasil
25? 25' 40" S 49? 16' 22" O
Have a look at this post and the rest of that thread:

https://stat.ethz.ch/pipermail/r-help/2010-January/223420.html
Hello, does anyone know how to take the mean for a subset of observations?
For example, suppose my data looks like this:

OBS ? ? NAME ? SCORE
1 ? ? ? ? ?Tom ? ? ? 92
2 ? ? ? ? ?Tom ? ? ? 88
3 ? ? ? ? ?Tom ? ? ? 56
4 ? ? ? ? ?James ? ?85
5 ? ? ? ? ?James ? ?75
6 ? ? ? ? ?James ? ?32
7 ? ? ? ? ?Dawn ? ? 56
8 ? ? ? ? ?Dawn ? ? 91
9 ? ? ? ? ?Clara ? ? 95
10 ? ? ? ?Clara ? ? 84

Is there a way to get the mean of the SCORE variable by NAME but only when
the number of observations is equal to 3? ?In other words, is there a way to
get the mean of the SCORE variable for Tom and James, but not for Dawn and
Clara? ?Thank you.

--
Geoffrey Smith
Visiting Assistant Professor
Department of Finance
W. P. Carey School of Business
Arizona State University

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Hello, does anyone know how to take the mean for a subset of observations?
For example, suppose my data looks like this:

OBS     NAME   SCORE
1          Tom       92
2          Tom       88
3          Tom       56
4          James    85
5          James    75
6          James    32
7          Dawn     56
8          Dawn     91
9          Clara     95
10        Clara     84

Is there a way to get the mean of the SCORE variable by NAME but only when
the number of observations is equal to 3?  In other words, is there a way to
get the mean of the SCORE variable for Tom and James, but not for Dawn and
Clara?  Thank you.
You can use tapply() together with a custom function that returns NA if 
the condition is not satisfied, e.g.

## read data
dat <- read.table(textConnection("
OBS     NAME   SCORE
1          Tom       92
2          Tom       88
3          Tom       56
4          James    85
5          James    75
6          James    32
7          Dawn     56
8          Dawn     91
9          Clara     95
10        Clara     84
"), header = TRUE)

## use tapply() with custom function
with(dat,
   tapply(SCORE, NAME, function(x) if(length(x) == 3) mean(x) else NA)
)

Alternatively you could look at

mymean <-   with(dat, tapply(SCORE, NAME, mean))
mylength <- with(dat, tapply(SCORE, NAME, length))
mymean[mylength == 3]

etc.

hth,
Z
-- 
Geoffrey Smith
Visiting Assistant Professor
Department of Finance
W. P. Carey School of Business
Arizona State University

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Hello, does anyone know how to take the mean for a subset of observations?
For example, suppose my data looks like this:

OBS     NAME   SCORE
1          Tom       92
2          Tom       88
3          Tom       56
4          James    85
5          James    75
6          James    32
7          Dawn     56
8          Dawn     91
9          Clara     95
10        Clara     84

Is there a way to get the mean of the SCORE variable by NAME but only when
the number of observations is equal to 3?  In other words, is there a way to
get the mean of the SCORE variable for Tom and James, but not for Dawn and
Clara?  Thank you.

You probably want to do it in two steps:  first, find which names have 3 
observations, and take that subset of the dataset; then do the mean on 
all groups.  This is one way:

 > counts <- table(dataset$NAME)
 > keep <- names(counts)[counts == 3]
 > subset <- dataset[ dataset$NAME %in% keep,]
 > tapply(subset$SCORE, subset$NAME, mean)
   Clara     Dawn    James      Tom
      NA       NA 64.00000 78.66667

Duncan Murdoch
Here is the solution using sqldf which can do it in one statement:
# read in data
Lines <- "OBS     NAME   SCORE
+ 1          Tom       92
+ 2          Tom       88
+ 3          Tom       56
+ 4          James    85
+ 5          James    75
+ 6          James    32
+ 7          Dawn     56
+ 8          Dawn     91
+ 9          Clara     95
+ 10        Clara     84"
DF <- read.table(textConnection(Lines), header = TRUE)

# run
library(sqldf)
sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) = 3")
NAME avg(SCORE)
1 James   64.00000
2   Tom   78.66667

On Tue, Jan 5, 2010 at 2:03 PM, Gabor Grothendieck
Have a look at this post and the rest of that thread:

https://stat.ethz.ch/pipermail/r-help/2010-January/223420.html

On Tue, Jan 5, 2010 at 1:29 PM, Geoffrey Smith <gps at asu.edu> wrote:
Hello, does anyone know how to take the mean for a subset of observations?
For example, suppose my data looks like this:

OBS ? ? NAME ? SCORE
1 ? ? ? ? ?Tom ? ? ? 92
2 ? ? ? ? ?Tom ? ? ? 88
3 ? ? ? ? ?Tom ? ? ? 56
4 ? ? ? ? ?James ? ?85
5 ? ? ? ? ?James ? ?75
6 ? ? ? ? ?James ? ?32
7 ? ? ? ? ?Dawn ? ? 56
8 ? ? ? ? ?Dawn ? ? 91
9 ? ? ? ? ?Clara ? ? 95
10 ? ? ? ?Clara ? ? 84

Is there a way to get the mean of the SCORE variable by NAME but only when
the number of observations is equal to 3? ?In other words, is there a way to
get the mean of the SCORE variable for Tom and James, but not for Dawn and
Clara? ?Thank you.

--
Geoffrey Smith
Visiting Assistant Professor
Department of Finance
W. P. Carey School of Business
Arizona State University

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

As can data.table (i.e. do 'having' in one statement) :
DT = data.table(DF)
DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]
NAME n       V2
[1,] James 3 64.00000
[2,]   Tom 3 78.66667

but data.table isn't restricted to SQL functions (such as avg),  any R 
functions can be used, sometimes for their side effects (such as plotting) 
rather than just returning data.

Further data.table has a thing called 'join inherited scoping'.   Say we 
knew the specific groups,  we can go directly to them (without even looking 
at the rest of the data in the table) in very short and convenient syntax, 
which also happens to run quickly on large data sets (but can be useful just 
for the syntax alone) :
setkey(DT,NAME)
DT[c("James","Tom"),mean(SCORE),mult="all"]
NAME       V1
[1,] James 64.00000
[2,]   Tom 78.66667

Notice there is no "group by" or even a "by" in the above.  It inherits the 
scope from the join because mult="all" means that "James" matches to 
multiple rows, as does "Tom", creating two groups.  It does it by binary 
search to the beginning of each group,  binary search to the end of the 
group,  and runs the R expression inside the scope of that group.

An example of join inherited scoping for the side effects only :
pdf("out.pdf")
DT[c("James","Tom"),plot(SCORE),mult="all"]
NULL data table
dev.off()
# out.pdf now contains 2 plots

which you couldn't do in SQL because SQL has no plotting (or any of R's 
other packages).

It aims to do this quickly.  Where 'quickly' means 1) shorter code is 
quicker to write, read, debug and maintain  and also  2) quicker to compute, 
and its 1 that often dominates 2.

Finally, consider the following two statements which are both equivalent :
sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) = 3")
NAME avg(SCORE)
1 James   64.00000
2   Tom   78.66667
DT[ J(DT[,length(NAME),by="NAME"][V1==3,NAME]), mean(SCORE), mult="all"]
NAME avg(SCORE)
1 James   64.00000
2   Tom   78.66667

Now ok I hear you groaning (!) that the 2nd looks (on first glance) ugly, 
but bear with me ... in the SQL solution do you know for sure that 
avg(SCORE) isn't computed wastefully for the all the groups that don't have 
count(*)=3 ?  It might well do the 'group by' first for all the groups, then 
do the 'having' afterwards as a 'where' on the result.  It might depend on 
the particular SQL database being used (mySQL, sqllite, etc) or the 
installation parameters, any indexes etc.  Some investigation would be 
required (taking time) if someone doesn't already know.  In the data.table 
however,  the syntax explictly makes it clear than mean(SCORE) is only 
computed for the particular groups.  For certain, always.   Maybe this 
particular example is not a good one, but I'm trying to demonstrate an 
overall syntax which is scalable (i.e. this syntax can do more complicated 
things that SQL can't, or can't do well).   Notice that the method earlier 
on i.e. "DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]"  is simpler 
but wasteful as it does compute mean(SCORE) for all the groups.  But the 
syntax explicity conveys what is being done, and the user has the choice.

"Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in message 
news:971536df1001051122l58389037p4e16288aedfdeb07 at mail.gmail.com...
Here is the solution using sqldf which can do it in one statement:
# read in data
Lines <- "OBS     NAME   SCORE
+ 1          Tom       92
+ 2          Tom       88
+ 3          Tom       56
+ 4          James    85
+ 5          James    75
+ 6          James    32
+ 7          Dawn     56
+ 8          Dawn     91
+ 9          Clara     95
+ 10        Clara     84"
DF <- read.table(textConnection(Lines), header = TRUE)

# run
library(sqldf)
sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) = 3")
NAME avg(SCORE)
1 James   64.00000
2   Tom   78.66667

On Tue, Jan 5, 2010 at 2:03 PM, Gabor Grothendieck
Have a look at this post and the rest of that thread:

https://stat.ethz.ch/pipermail/r-help/2010-January/223420.html

On Tue, Jan 5, 2010 at 1:29 PM, Geoffrey Smith <gps at asu.edu> wrote:
Hello, does anyone know how to take the mean for a subset of 
observations?
For example, suppose my data looks like this:

OBS NAME SCORE
1 Tom 92
2 Tom 88
3 Tom 56
4 James 85
5 James 75
6 James 32
7 Dawn 56
8 Dawn 91
9 Clara 95
10 Clara 84

Is there a way to get the mean of the SCORE variable by NAME but only 
when
the number of observations is equal to 3? In other words, is there a way 
to
get the mean of the SCORE variable for Tom and James, but not for Dawn 
and
Clara? Thank you.

--
Geoffrey Smith
Visiting Assistant Professor
Department of Finance
W. P. Carey School of Business
Arizona State University

[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

As a novice R user, I face a similar challenge. I am almost afraid to share
with this group how I solved it. About 65 labs in our proficiency program
submit data on individual Excel spreadsheets with triple replicates. There
always are a few labs that do not complete the full set of three replicates,
and I do not want their data included in my analysis.

First, I combine all the individual spreadsheets into one large Excel
spreadsheet. The replicates are in three columns: rep1, rep2, and rep3. I
sort on each individual rep column in Excel. Then I go to both the top and
the bottom of the list. 

For example, I sort on rep1 and go to the top of the list to delete any rows
where a value for rep1 was not recorded. Then I go to the bottom of the list
and delete any rows where rep1 is text instead of a number, for example,
<0.001. I should say that the labs are instructed that they must complete
all three replicates, and they must not enter results as text. Next I repeat
the process for rep2 and rep3. 

I'll do a little more work in Excel on the large, combined table with all
the lab data. I calculate in Excel the mean, standard deviation, and
coefficient of variation for each of the three reps. Finally, I filter all
the data and delete duplicate rows. This is necessary as I sometimes
accidentally copy the same spreadsheet two times from a lab into my large
table. Finally, I save the cleaned up table in *.csv format that is easily
read into R. 

I know that R can do all of these things, but if you are just learning how
to use R it might be easier to do some initial work in Excel, or a similar
spreadsheet, before running your data through R.

I also use MS-Word's mail merge feature to generate my code. I'll get three
or four pages of code doing what I want for a single analytical test, for
example, calcium. Then I'll use the mail merge feature to generate hundreds
of pages of code with the other analytical tests (nitrogen, phosphorus,
potassium, etc.). I just copy and paste the large, merged Word document into
R. R cranks away for 30 minutes and I end up with several large tables (and
these get additional editing in Ecel) and hundreds of beautiful graphs that
would take weeks to create in Excel.

I was amazed that Word would work. I expected all of Word's special print
control codes would mess things up. I just recently received a new laptop
computer, and now I have an occassional problem with Word's "pretty print
quotes," but if you know about that problem, it is easy to fix.

Jerry Floren
Minnesota Department of Agriculture

As can data.table (i.e. do 'having' in one statement) :

DT = data.table(DF)
DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]
      NAME n       V2
[1,] James 3 64.00000
[2,]   Tom 3 78.66667

but data.table isn't restricted to SQL functions (such as avg),  any R 
functions can be used, sometimes for their side effects (such as plotting) 
rather than just returning data.

Further data.table has a thing called 'join inherited scoping'.   Say we 
knew the specific groups,  we can go directly to them (without even
looking 
at the rest of the data in the table) in very short and convenient syntax, 
which also happens to run quickly on large data sets (but can be useful
just 
for the syntax alone) :

setkey(DT,NAME)
DT[c("James","Tom"),mean(SCORE),mult="all"]
      NAME       V1
[1,] James 64.00000
[2,]   Tom 78.66667

Notice there is no "group by" or even a "by" in the above.  It inherits
the 
scope from the join because mult="all" means that "James" matches to 
multiple rows, as does "Tom", creating two groups.  It does it by binary 
search to the beginning of each group,  binary search to the end of the 
group,  and runs the R expression inside the scope of that group.

An example of join inherited scoping for the side effects only :

pdf("out.pdf")
DT[c("James","Tom"),plot(SCORE),mult="all"]
NULL data table
dev.off()
# out.pdf now contains 2 plots

which you couldn't do in SQL because SQL has no plotting (or any of R's 
other packages).

It aims to do this quickly.  Where 'quickly' means 1) shorter code is 
quicker to write, read, debug and maintain  and also  2) quicker to
compute, 
and its 1 that often dominates 2.

Finally, consider the following two statements which are both equivalent :

sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) =
3")
   NAME avg(SCORE)
1 James   64.00000
2   Tom   78.66667
DT[ J(DT[,length(NAME),by="NAME"][V1==3,NAME]), mean(SCORE), mult="all"]
   NAME avg(SCORE)
1 James   64.00000
2   Tom   78.66667

Now ok I hear you groaning (!) that the 2nd looks (on first glance) ugly, 
but bear with me ... in the SQL solution do you know for sure that 
avg(SCORE) isn't computed wastefully for the all the groups that don't
have 
count(*)=3 ?  It might well do the 'group by' first for all the groups,
then 
do the 'having' afterwards as a 'where' on the result.  It might depend on 
the particular SQL database being used (mySQL, sqllite, etc) or the 
installation parameters, any indexes etc.  Some investigation would be 
required (taking time) if someone doesn't already know.  In the data.table 
however,  the syntax explictly makes it clear than mean(SCORE) is only 
computed for the particular groups.  For certain, always.   Maybe this 
particular example is not a good one, but I'm trying to demonstrate an 
overall syntax which is scalable (i.e. this syntax can do more complicated 
things that SQL can't, or can't do well).   Notice that the method earlier 
on i.e. "DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]"  is
simpler 
but wasteful as it does compute mean(SCORE) for all the groups.  But the 
syntax explicity conveys what is being done, and the user has the choice.

"Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in message 
news:971536df1001051122l58389037p4e16288aedfdeb07 at mail.gmail.com...
Here is the solution using sqldf which can do it in one statement:

# read in data
Lines <- "OBS     NAME   SCORE
+ 1          Tom       92
+ 2          Tom       88
+ 3          Tom       56
+ 4          James    85
+ 5          James    75
+ 6          James    32
+ 7          Dawn     56
+ 8          Dawn     91
+ 9          Clara     95
+ 10        Clara     84"
DF <- read.table(textConnection(Lines), header = TRUE)

# run
library(sqldf)
sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) =
3")
   NAME avg(SCORE)
1 James   64.00000
2   Tom   78.66667

On Tue, Jan 5, 2010 at 2:03 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
Have a look at this post and the rest of that thread:

https://stat.ethz.ch/pipermail/r-help/2010-January/223420.html

On Tue, Jan 5, 2010 at 1:29 PM, Geoffrey Smith <gps at asu.edu> wrote:
Hello, does anyone know how to take the mean for a subset of 
observations?
For example, suppose my data looks like this:

OBS NAME SCORE
1 Tom 92
2 Tom 88
3 Tom 56
4 James 85
5 James 75
6 James 32
7 Dawn 56
8 Dawn 91
9 Clara 95
10 Clara 84

Is there a way to get the mean of the SCORE variable by NAME but only 
when
the number of observations is equal to 3? In other words, is there a way 
to
get the mean of the SCORE variable for Tom and James, but not for Dawn 
and
Clara? Thank you.

--
Geoffrey Smith
Visiting Assistant Professor
Department of Finance
W. P. Carey School of Business
Arizona State University

[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

View this message in context: http://n4.nabble.com/mean-for-subset-tp999254p1008892.html
Sent from the R help mailing list archive at Nabble.com.
Did you really say you're using Word's mail merge to construct "hundreds" of 
pages of R code which you then paste in to R ? It sounds like you just 
missed somehow how to create a function in R.  Did you fully read the book 
Introduction to R ?  Did you know R can read xls directly, and connect to 
spreadsheets as if they were databases, see ?odbcConnectExcel.

Your graphs may exist and be beautiful but are they correct ?   This link 
contains a formal discussion of the topic :
http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html

"Jerry Floren" <jerry.floren at state.mn.us> wrote in message 
news:1262877373634-1008892.post at n4.nabble.com...
As a novice R user, I face a similar challenge. I am almost afraid to 
share
with this group how I solved it. About 65 labs in our proficiency program
submit data on individual Excel spreadsheets with triple replicates. There
always are a few labs that do not complete the full set of three 
replicates,
and I do not want their data included in my analysis.

First, I combine all the individual spreadsheets into one large Excel
spreadsheet. The replicates are in three columns: rep1, rep2, and rep3. I
sort on each individual rep column in Excel. Then I go to both the top and
the bottom of the list.

For example, I sort on rep1 and go to the top of the list to delete any 
rows
where a value for rep1 was not recorded. Then I go to the bottom of the 
list
and delete any rows where rep1 is text instead of a number, for example,
<0.001. I should say that the labs are instructed that they must complete
all three replicates, and they must not enter results as text. Next I 
repeat
the process for rep2 and rep3.

I'll do a little more work in Excel on the large, combined table with all
the lab data. I calculate in Excel the mean, standard deviation, and
coefficient of variation for each of the three reps. Finally, I filter all
the data and delete duplicate rows. This is necessary as I sometimes
accidentally copy the same spreadsheet two times from a lab into my large
table. Finally, I save the cleaned up table in *.csv format that is easily
read into R.

I know that R can do all of these things, but if you are just learning how
to use R it might be easier to do some initial work in Excel, or a similar
spreadsheet, before running your data through R.

I also use MS-Word's mail merge feature to generate my code. I'll get 
three
or four pages of code doing what I want for a single analytical test, for
example, calcium. Then I'll use the mail merge feature to generate 
hundreds
of pages of code with the other analytical tests (nitrogen, phosphorus,
potassium, etc.). I just copy and paste the large, merged Word document 
into
R. R cranks away for 30 minutes and I end up with several large tables 
(and
these get additional editing in Ecel) and hundreds of beautiful graphs 
that
would take weeks to create in Excel.

I was amazed that Word would work. I expected all of Word's special print
control codes would mess things up. I just recently received a new laptop
computer, and now I have an occassional problem with Word's "pretty print
quotes," but if you know about that problem, it is easy to fix.

Jerry Floren
Minnesota Department of Agriculture

Matthew Dowle-3 wrote:

As can data.table (i.e. do 'having' in one statement) :

DT = data.table(DF)
DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]
      NAME n       V2
[1,] James 3 64.00000
[2,]   Tom 3 78.66667

but data.table isn't restricted to SQL functions (such as avg),  any R
functions can be used, sometimes for their side effects (such as 
plotting)
rather than just returning data.

Further data.table has a thing called 'join inherited scoping'.   Say we
knew the specific groups,  we can go directly to them (without even
looking
at the rest of the data in the table) in very short and convenient 
syntax,
which also happens to run quickly on large data sets (but can be useful
just
for the syntax alone) :

setkey(DT,NAME)
DT[c("James","Tom"),mean(SCORE),mult="all"]
      NAME       V1
[1,] James 64.00000
[2,]   Tom 78.66667

Notice there is no "group by" or even a "by" in the above.  It inherits
the
scope from the join because mult="all" means that "James" matches to
multiple rows, as does "Tom", creating two groups.  It does it by binary
search to the beginning of each group,  binary search to the end of the
group,  and runs the R expression inside the scope of that group.

An example of join inherited scoping for the side effects only :

pdf("out.pdf")
DT[c("James","Tom"),plot(SCORE),mult="all"]
NULL data table
dev.off()
# out.pdf now contains 2 plots

which you couldn't do in SQL because SQL has no plotting (or any of R's
other packages).

It aims to do this quickly.  Where 'quickly' means 1) shorter code is
quicker to write, read, debug and maintain  and also  2) quicker to
compute,
and its 1 that often dominates 2.

Finally, consider the following two statements which are both equivalent 
:

sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) =
3")
   NAME avg(SCORE)
1 James   64.00000
2   Tom   78.66667
DT[ J(DT[,length(NAME),by="NAME"][V1==3,NAME]), mean(SCORE), mult="all"]
   NAME avg(SCORE)
1 James   64.00000
2   Tom   78.66667

Now ok I hear you groaning (!) that the 2nd looks (on first glance) ugly,
but bear with me ... in the SQL solution do you know for sure that
avg(SCORE) isn't computed wastefully for the all the groups that don't
have
count(*)=3 ?  It might well do the 'group by' first for all the groups,
then
do the 'having' afterwards as a 'where' on the result.  It might depend 
on
the particular SQL database being used (mySQL, sqllite, etc) or the
installation parameters, any indexes etc.  Some investigation would be
required (taking time) if someone doesn't already know.  In the 
data.table
however,  the syntax explictly makes it clear than mean(SCORE) is only
computed for the particular groups.  For certain, always.   Maybe this
particular example is not a good one, but I'm trying to demonstrate an
overall syntax which is scalable (i.e. this syntax can do more 
complicated
things that SQL can't, or can't do well).   Notice that the method 
earlier
on i.e. "DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]"  is
simpler
but wasteful as it does compute mean(SCORE) for all the groups.  But the
syntax explicity conveys what is being done, and the user has the choice.

"Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in message
news:971536df1001051122l58389037p4e16288aedfdeb07 at mail.gmail.com...
Here is the solution using sqldf which can do it in one statement:

# read in data
Lines <- "OBS     NAME   SCORE
+ 1          Tom       92
+ 2          Tom       88
+ 3          Tom       56
+ 4          James    85
+ 5          James    75
+ 6          James    32
+ 7          Dawn     56
+ 8          Dawn     91
+ 9          Clara     95
+ 10        Clara     84"
DF <- read.table(textConnection(Lines), header = TRUE)

# run
library(sqldf)
sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) =
3")
   NAME avg(SCORE)
1 James   64.00000
2   Tom   78.66667

On Tue, Jan 5, 2010 at 2:03 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
Have a look at this post and the rest of that thread:

https://stat.ethz.ch/pipermail/r-help/2010-January/223420.html

On Tue, Jan 5, 2010 at 1:29 PM, Geoffrey Smith <gps at asu.edu> wrote:
Hello, does anyone know how to take the mean for a subset of
observations?
For example, suppose my data looks like this:

OBS NAME SCORE
1 Tom 92
2 Tom 88
3 Tom 56
4 James 85
5 James 75
6 James 32
7 Dawn 56
8 Dawn 91
9 Clara 95
10 Clara 84

Is there a way to get the mean of the SCORE variable by NAME but only
when
the number of observations is equal to 3? In other words, is there a 
way
to
get the mean of the SCORE variable for Tom and James, but not for Dawn
and
Clara? Thank you.

--
Geoffrey Smith
Visiting Assistant Professor
Department of Finance
W. P. Carey School of Business
Arizona State University

[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

-- 
View this message in context: 
http://n4.nabble.com/mean-for-subset-tp999254p1008892.html
Sent from the R help mailing list archive at Nabble.com.

Thanks for your insights Matthew. Actually, some of the merged documents are
over 1,000 pages. I have never had a programming class, and I had one
college statistics course in 1975. I might be in over my head, but R along
with Word's mail merge has allowed me to put together some pretty useful
reports.

Thanks for your suggestions; I am sure they will help improve my R skill
set.

Jerry Floren
Minnesota Department of Agriculture

Did you really say you're using Word's mail merge to construct "hundreds"
of 
pages of R code which you then paste in to R ? It sounds like you just 
missed somehow how to create a function in R.  Did you fully read the book 
Introduction to R ?  Did you know R can read xls directly, and connect to 
spreadsheets as if they were databases, see ?odbcConnectExcel.

Your graphs may exist and be beautiful but are they correct ?   This link 
contains a formal discussion of the topic :
http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html

"Jerry Floren" <jerry.floren at state.mn.us> wrote in message 
news:1262877373634-1008892.post at n4.nabble.com...
As a novice R user, I face a similar challenge. I am almost afraid to 
share
with this group how I solved it. About 65 labs in our proficiency program
submit data on individual Excel spreadsheets with triple replicates.
There
always are a few labs that do not complete the full set of three 
replicates,
and I do not want their data included in my analysis.

First, I combine all the individual spreadsheets into one large Excel
spreadsheet. The replicates are in three columns: rep1, rep2, and rep3. I
sort on each individual rep column in Excel. Then I go to both the top
and
the bottom of the list.

For example, I sort on rep1 and go to the top of the list to delete any 
rows
where a value for rep1 was not recorded. Then I go to the bottom of the 
list
and delete any rows where rep1 is text instead of a number, for example,
<0.001. I should say that the labs are instructed that they must complete
all three replicates, and they must not enter results as text. Next I 
repeat
the process for rep2 and rep3.

I'll do a little more work in Excel on the large, combined table with all
the lab data. I calculate in Excel the mean, standard deviation, and
coefficient of variation for each of the three reps. Finally, I filter
all
the data and delete duplicate rows. This is necessary as I sometimes
accidentally copy the same spreadsheet two times from a lab into my large
table. Finally, I save the cleaned up table in *.csv format that is
easily
read into R.

I know that R can do all of these things, but if you are just learning
how
to use R it might be easier to do some initial work in Excel, or a
similar
spreadsheet, before running your data through R.

I also use MS-Word's mail merge feature to generate my code. I'll get 
three
or four pages of code doing what I want for a single analytical test, for
example, calcium. Then I'll use the mail merge feature to generate 
hundreds
of pages of code with the other analytical tests (nitrogen, phosphorus,
potassium, etc.). I just copy and paste the large, merged Word document 
into
R. R cranks away for 30 minutes and I end up with several large tables 
(and
these get additional editing in Ecel) and hundreds of beautiful graphs 
that
would take weeks to create in Excel.

I was amazed that Word would work. I expected all of Word's special print
control codes would mess things up. I just recently received a new laptop
computer, and now I have an occassional problem with Word's "pretty print
quotes," but if you know about that problem, it is easy to fix.

Jerry Floren
Minnesota Department of Agriculture

Matthew Dowle-3 wrote:

As can data.table (i.e. do 'having' in one statement) :

DT = data.table(DF)
DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]
      NAME n       V2
[1,] James 3 64.00000
[2,]   Tom 3 78.66667

but data.table isn't restricted to SQL functions (such as avg),  any R
functions can be used, sometimes for their side effects (such as 
plotting)
rather than just returning data.

Further data.table has a thing called 'join inherited scoping'.   Say we
knew the specific groups,  we can go directly to them (without even
looking
at the rest of the data in the table) in very short and convenient 
syntax,
which also happens to run quickly on large data sets (but can be useful
just
for the syntax alone) :

setkey(DT,NAME)
DT[c("James","Tom"),mean(SCORE),mult="all"]
      NAME       V1
[1,] James 64.00000
[2,]   Tom 78.66667

Notice there is no "group by" or even a "by" in the above.  It inherits
the
scope from the join because mult="all" means that "James" matches to
multiple rows, as does "Tom", creating two groups.  It does it by binary
search to the beginning of each group,  binary search to the end of the
group,  and runs the R expression inside the scope of that group.

An example of join inherited scoping for the side effects only :

pdf("out.pdf")
DT[c("James","Tom"),plot(SCORE),mult="all"]
NULL data table
dev.off()
# out.pdf now contains 2 plots

which you couldn't do in SQL because SQL has no plotting (or any of R's
other packages).

It aims to do this quickly.  Where 'quickly' means 1) shorter code is
quicker to write, read, debug and maintain  and also  2) quicker to
compute,
and its 1 that often dominates 2.

Finally, consider the following two statements which are both equivalent 
:

sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) =
3")
   NAME avg(SCORE)
1 James   64.00000
2   Tom   78.66667
DT[ J(DT[,length(NAME),by="NAME"][V1==3,NAME]), mean(SCORE),
mult="all"]
   NAME avg(SCORE)
1 James   64.00000
2   Tom   78.66667

Now ok I hear you groaning (!) that the 2nd looks (on first glance)
ugly,
but bear with me ... in the SQL solution do you know for sure that
avg(SCORE) isn't computed wastefully for the all the groups that don't
have
count(*)=3 ?  It might well do the 'group by' first for all the groups,
then
do the 'having' afterwards as a 'where' on the result.  It might depend 
on
the particular SQL database being used (mySQL, sqllite, etc) or the
installation parameters, any indexes etc.  Some investigation would be
required (taking time) if someone doesn't already know.  In the 
data.table
however,  the syntax explictly makes it clear than mean(SCORE) is only
computed for the particular groups.  For certain, always.   Maybe this
particular example is not a good one, but I'm trying to demonstrate an
overall syntax which is scalable (i.e. this syntax can do more 
complicated
things that SQL can't, or can't do well).   Notice that the method 
earlier
on i.e. "DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]"  is
simpler
but wasteful as it does compute mean(SCORE) for all the groups.  But the
syntax explicity conveys what is being done, and the user has the
choice.

"Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in message
news:971536df1001051122l58389037p4e16288aedfdeb07 at mail.gmail.com...
Here is the solution using sqldf which can do it in one statement:

# read in data
Lines <- "OBS     NAME   SCORE
+ 1          Tom       92
+ 2          Tom       88
+ 3          Tom       56
+ 4          James    85
+ 5          James    75
+ 6          James    32
+ 7          Dawn     56
+ 8          Dawn     91
+ 9          Clara     95
+ 10        Clara     84"
DF <- read.table(textConnection(Lines), header = TRUE)

# run
library(sqldf)
sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) =
3")
   NAME avg(SCORE)
1 James   64.00000
2   Tom   78.66667

On Tue, Jan 5, 2010 at 2:03 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
Have a look at this post and the rest of that thread:

https://stat.ethz.ch/pipermail/r-help/2010-January/223420.html

On Tue, Jan 5, 2010 at 1:29 PM, Geoffrey Smith <gps at asu.edu> wrote:
Hello, does anyone know how to take the mean for a subset of
observations?
For example, suppose my data looks like this:

OBS NAME SCORE
1 Tom 92
2 Tom 88
3 Tom 56
4 James 85
5 James 75
6 James 32
7 Dawn 56
8 Dawn 91
9 Clara 95
10 Clara 84

Is there a way to get the mean of the SCORE variable by NAME but only
when
the number of observations is equal to 3? In other words, is there a 
way
to
get the mean of the SCORE variable for Tom and James, but not for Dawn
and
Clara? Thank you.

--
Geoffrey Smith
Visiting Assistant Professor
Department of Finance
W. P. Carey School of Business
Arizona State University

[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

-- 
View this message in context: 
http://n4.nabble.com/mean-for-subset-tp999254p1008892.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

View this message in context: http://n4.nabble.com/mean-for-subset-tp999254p1009194.html
Sent from the R help mailing list archive at Nabble.com.