Skip to content

identify duplicate entries in data frame and calculate mean

7 messages · Matthew, Nordlund, Dan (DSHS/RDA), Tom

#
I have a data frame with 10 columns.
In the last column is an alphaneumaric identifier.
For most rows, this alphaneumaric identifier is unique to the file, 
however some of these alphanemeric idenitifiers occur in duplicate, 
triplicate or more. When they do occur more than once they are in 
consecutive rows, so when there is a duplicate or triplicate or 
quadruplicate (let's call them multiplicates), they are in consecutive rows.

In column 7 there is an integer number (may or may not be unique. does 
not matter).

I want to identify each multiple entries (multiplicates) occurring in 
column 10 and then for each multiplicate calculate the mean of the 
integers column 7.

As an example, I will show just two columns:
Length  Identifier
321     A234
350     A234
340     A234
180     B123
198     B225

What I want to do (in the above example) is collapse all the A234's and 
report the mean to get this:
Length  Identifier
337     A234
180     B123
198     B225


Matthew
Tom
#
Using dplyr

$ library(dplyr)
$ x<-data.frame(Length=c(321,350,340,180,198),
                        ID=c(rep('A234',3),'B123','B225') )
$ x %>% group_by(ID) %>% summarise(m=mean(Length))



On Tue, May 24, 2016 at 3:46 PM, Matthew <mccormack at molbio.mgh.harvard.edu>
wrote:

  
  
#
You have several  options.  

1.  You could use the aggregate function.  If your data frame is called DF, you could do something like

with(DF, aggregate(Length, list(Identifier), mean))

2.  You could use the dplyr package like this

library(dplyr)
summarize(group_by(DF, Identifier), mean(Length))


Hope this is helpful,

Dan

Daniel Nordlund, PhD
Research and Data Analysis Division
Services & Enterprise Support Administration
Washington State Department of Social and Health Services
#
Thank you very much, Tom.
This gets me thinking in the right direction.
One thing I should have mentioned that I did not is that the number of 
rows in the data frame will be a little over 40,000 rows.
On 5/24/2016 4:08 PM, Tom Wright wrote:

  
  
Tom
#
Don't see that as being a big problem. If your data grows then dplyr
supports connections to external databases. Alternately if you just want a
mean, most databases can do that directly in SQL.

On Tue, May 24, 2016 at 4:17 PM, Matthew <mccormack at molbio.mgh.harvard.edu>
wrote:

  
  
#
Thanks, Tom.  I was making a mistake looking at your example and that's 
what my problem was.

Cool answer, works great. Thank you very much.

Matthew
On 5/24/2016 4:23 PM, Tom Wright wrote:

  
  
#
Thank you very much, Dan.

These work great. Two more great answers to my question.

Matthew
On 5/24/2016 4:15 PM, Nordlund, Dan (DSHS/RDA) wrote: