mean for subset
As a novice R user, I face a similar challenge. I am almost afraid to share with this group how I solved it. About 65 labs in our proficiency program submit data on individual Excel spreadsheets with triple replicates. There always are a few labs that do not complete the full set of three replicates, and I do not want their data included in my analysis. First, I combine all the individual spreadsheets into one large Excel spreadsheet. The replicates are in three columns: rep1, rep2, and rep3. I sort on each individual rep column in Excel. Then I go to both the top and the bottom of the list. For example, I sort on rep1 and go to the top of the list to delete any rows where a value for rep1 was not recorded. Then I go to the bottom of the list and delete any rows where rep1 is text instead of a number, for example, <0.001. I should say that the labs are instructed that they must complete all three replicates, and they must not enter results as text. Next I repeat the process for rep2 and rep3. I'll do a little more work in Excel on the large, combined table with all the lab data. I calculate in Excel the mean, standard deviation, and coefficient of variation for each of the three reps. Finally, I filter all the data and delete duplicate rows. This is necessary as I sometimes accidentally copy the same spreadsheet two times from a lab into my large table. Finally, I save the cleaned up table in *.csv format that is easily read into R. I know that R can do all of these things, but if you are just learning how to use R it might be easier to do some initial work in Excel, or a similar spreadsheet, before running your data through R. I also use MS-Word's mail merge feature to generate my code. I'll get three or four pages of code doing what I want for a single analytical test, for example, calcium. Then I'll use the mail merge feature to generate hundreds of pages of code with the other analytical tests (nitrogen, phosphorus, potassium, etc.). I just copy and paste the large, merged Word document into R. R cranks away for 30 minutes and I end up with several large tables (and these get additional editing in Ecel) and hundreds of beautiful graphs that would take weeks to create in Excel. I was amazed that Word would work. I expected all of Word's special print control codes would mess things up. I just recently received a new laptop computer, and now I have an occassional problem with Word's "pretty print quotes," but if you know about that problem, it is easy to fix. Jerry Floren Minnesota Department of Agriculture
Matthew Dowle-3 wrote:
As can data.table (i.e. do 'having' in one statement) :
DT = data.table(DF) DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]
NAME n V2 [1,] James 3 64.00000 [2,] Tom 3 78.66667
but data.table isn't restricted to SQL functions (such as avg), any R functions can be used, sometimes for their side effects (such as plotting) rather than just returning data. Further data.table has a thing called 'join inherited scoping'. Say we knew the specific groups, we can go directly to them (without even looking at the rest of the data in the table) in very short and convenient syntax, which also happens to run quickly on large data sets (but can be useful just for the syntax alone) :
setkey(DT,NAME)
DT[c("James","Tom"),mean(SCORE),mult="all"]
NAME V1 [1,] James 64.00000 [2,] Tom 78.66667
Notice there is no "group by" or even a "by" in the above. It inherits the scope from the join because mult="all" means that "James" matches to multiple rows, as does "Tom", creating two groups. It does it by binary search to the beginning of each group, binary search to the end of the group, and runs the R expression inside the scope of that group. An example of join inherited scoping for the side effects only :
pdf("out.pdf")
DT[c("James","Tom"),plot(SCORE),mult="all"]
NULL data table
dev.off()
# out.pdf now contains 2 plots which you couldn't do in SQL because SQL has no plotting (or any of R's other packages). It aims to do this quickly. Where 'quickly' means 1) shorter code is quicker to write, read, debug and maintain and also 2) quicker to compute, and its 1 that often dominates 2. Finally, consider the following two statements which are both equivalent :
sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) =
3")
NAME avg(SCORE) 1 James 64.00000 2 Tom 78.66667
DT[ J(DT[,length(NAME),by="NAME"][V1==3,NAME]), mean(SCORE), mult="all"]
NAME avg(SCORE) 1 James 64.00000 2 Tom 78.66667 Now ok I hear you groaning (!) that the 2nd looks (on first glance) ugly, but bear with me ... in the SQL solution do you know for sure that avg(SCORE) isn't computed wastefully for the all the groups that don't have count(*)=3 ? It might well do the 'group by' first for all the groups, then do the 'having' afterwards as a 'where' on the result. It might depend on the particular SQL database being used (mySQL, sqllite, etc) or the installation parameters, any indexes etc. Some investigation would be required (taking time) if someone doesn't already know. In the data.table however, the syntax explictly makes it clear than mean(SCORE) is only computed for the particular groups. For certain, always. Maybe this particular example is not a good one, but I'm trying to demonstrate an overall syntax which is scalable (i.e. this syntax can do more complicated things that SQL can't, or can't do well). Notice that the method earlier on i.e. "DT[,list(n=length(NAME),mean(SCORE)),by="NAME"][n==3]" is simpler but wasteful as it does compute mean(SCORE) for all the groups. But the syntax explicity conveys what is being done, and the user has the choice. "Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in message news:971536df1001051122l58389037p4e16288aedfdeb07 at mail.gmail.com... Here is the solution using sqldf which can do it in one statement:
# read in data Lines <- "OBS NAME SCORE
+ 1 Tom 92 + 2 Tom 88 + 3 Tom 56 + 4 James 85 + 5 James 75 + 6 James 32 + 7 Dawn 56 + 8 Dawn 91 + 9 Clara 95 + 10 Clara 84"
DF <- read.table(textConnection(Lines), header = TRUE)
# run
library(sqldf)
sqldf("select NAME, avg(SCORE) from DF group by NAME having count(*) =
3")
NAME avg(SCORE) 1 James 64.00000 2 Tom 78.66667 On Tue, Jan 5, 2010 at 2:03 PM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
Have a look at this post and the rest of that thread: https://stat.ethz.ch/pipermail/r-help/2010-January/223420.html On Tue, Jan 5, 2010 at 1:29 PM, Geoffrey Smith <gps at asu.edu> wrote:
Hello, does anyone know how to take the mean for a subset of observations? For example, suppose my data looks like this: OBS NAME SCORE 1 Tom 92 2 Tom 88 3 Tom 56 4 James 85 5 James 75 6 James 32 7 Dawn 56 8 Dawn 91 9 Clara 95 10 Clara 84 Is there a way to get the mean of the SCORE variable by NAME but only when the number of observations is equal to 3? In other words, is there a way to get the mean of the SCORE variable for Tom and James, but not for Dawn and Clara? Thank you. -- Geoffrey Smith Visiting Assistant Professor Department of Finance W. P. Carey School of Business Arizona State University [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
View this message in context: http://n4.nabble.com/mean-for-subset-tp999254p1008892.html Sent from the R help mailing list archive at Nabble.com.