I have a data frame with 10 columns. In the last column is an alphaneumaric identifier. For most rows, this alphaneumaric identifier is unique to the file, however some of these alphanemeric idenitifiers occur in duplicate, triplicate or more. When they do occur more than once they are in consecutive rows, so when there is a duplicate or triplicate or quadruplicate (let's call them multiplicates), they are in consecutive rows. In column 7 there is an integer number (may or may not be unique. does not matter). I want to identify each multiple entries (multiplicates) occurring in column 10 and then for each multiplicate calculate the mean of the integers column 7. As an example, I will show just two columns: Length Identifier 321 A234 350 A234 340 A234 180 B123 198 B225 What I want to do (in the above example) is collapse all the A234's and report the mean to get this: Length Identifier 337 A234 180 B123 198 B225 Matthew
identify duplicate entries in data frame and calculate mean
7 messages · Matthew, Nordlund, Dan (DSHS/RDA), Tom
Using dplyr
$ library(dplyr)
$ x<-data.frame(Length=c(321,350,340,180,198),
ID=c(rep('A234',3),'B123','B225') )
$ x %>% group_by(ID) %>% summarise(m=mean(Length))
On Tue, May 24, 2016 at 3:46 PM, Matthew <mccormack at molbio.mgh.harvard.edu>
wrote:
I have a data frame with 10 columns. In the last column is an alphaneumaric identifier. For most rows, this alphaneumaric identifier is unique to the file, however some of these alphanemeric idenitifiers occur in duplicate, triplicate or more. When they do occur more than once they are in consecutive rows, so when there is a duplicate or triplicate or quadruplicate (let's call them multiplicates), they are in consecutive rows. In column 7 there is an integer number (may or may not be unique. does not matter). I want to identify each multiple entries (multiplicates) occurring in column 10 and then for each multiplicate calculate the mean of the integers column 7. As an example, I will show just two columns: Length Identifier 321 A234 350 A234 340 A234 180 B123 198 B225 What I want to do (in the above example) is collapse all the A234's and report the mean to get this: Length Identifier 337 A234 180 B123 198 B225 Matthew
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
You have several options. 1. You could use the aggregate function. If your data frame is called DF, you could do something like with(DF, aggregate(Length, list(Identifier), mean)) 2. You could use the dplyr package like this library(dplyr) summarize(group_by(DF, Identifier), mean(Length)) Hope this is helpful, Dan Daniel Nordlund, PhD Research and Data Analysis Division Services & Enterprise Support Administration Washington State Department of Social and Health Services
-----Original Message----- From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Matthew Sent: Tuesday, May 24, 2016 12:47 PM To: r-help at r-project.org Subject: [R] identify duplicate entries in data frame and calculate mean I have a data frame with 10 columns. In the last column is an alphaneumaric identifier. For most rows, this alphaneumaric identifier is unique to the file, however some of these alphanemeric idenitifiers occur in duplicate, triplicate or more. When they do occur more than once they are in consecutive rows, so when there is a duplicate or triplicate or quadruplicate (let's call them multiplicates), they are in consecutive rows. In column 7 there is an integer number (may or may not be unique. does not matter). I want to identify each multiple entries (multiplicates) occurring in column 10 and then for each multiplicate calculate the mean of the integers column 7. As an example, I will show just two columns: Length Identifier 321 A234 350 A234 340 A234 180 B123 198 B225 What I want to do (in the above example) is collapse all the A234's and report the mean to get this: Length Identifier 337 A234 180 B123 198 B225 Matthew
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
Thank you very much, Tom. This gets me thinking in the right direction. One thing I should have mentioned that I did not is that the number of rows in the data frame will be a little over 40,000 rows.
On 5/24/2016 4:08 PM, Tom Wright wrote:
Using dplyr
$ library(dplyr)
$ x<-data.frame(Length=c(321,350,340,180,198),
ID=c(rep('A234',3),'B123','B225') )
$ x %>% group_by(ID) %>% summarise(m=mean(Length))
On Tue, May 24, 2016 at 3:46 PM, Matthew
<mccormack at molbio.mgh.harvard.edu
<mailto:mccormack at molbio.mgh.harvard.edu>> wrote:
I have a data frame with 10 columns.
In the last column is an alphaneumaric identifier.
For most rows, this alphaneumaric identifier is unique to the
file, however some of these alphanemeric idenitifiers occur in
duplicate, triplicate or more. When they do occur more than once
they are in consecutive rows, so when there is a duplicate or
triplicate or quadruplicate (let's call them multiplicates), they
are in consecutive rows.
In column 7 there is an integer number (may or may not be unique.
does not matter).
I want to identify each multiple entries (multiplicates) occurring
in column 10 and then for each multiplicate calculate the mean of
the integers column 7.
As an example, I will show just two columns:
Length Identifier
321 A234
350 A234
340 A234
180 B123
198 B225
What I want to do (in the above example) is collapse all the
A234's and report the mean to get this:
Length Identifier
337 A234
180 B123
198 B225
Matthew
______________________________________________
R-help at r-project.org <mailto:R-help at r-project.org> mailing list --
To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Don't see that as being a big problem. If your data grows then dplyr supports connections to external databases. Alternately if you just want a mean, most databases can do that directly in SQL. On Tue, May 24, 2016 at 4:17 PM, Matthew <mccormack at molbio.mgh.harvard.edu> wrote:
Thank you very much, Tom.
This gets me thinking in the right direction.
One thing I should have mentioned that I did not is that the number of
rows in the data frame will be a little over 40,000 rows.
On 5/24/2016 4:08 PM, Tom Wright wrote:
Using dplyr
$ library(dplyr)
$ x<-data.frame(Length=c(321,350,340,180,198),
ID=c(rep('A234',3),'B123','B225') )
$ x %>% group_by(ID) %>% summarise(m=mean(Length))
On Tue, May 24, 2016 at 3:46 PM, Matthew <mccormack at molbio.mgh.harvard.edu
wrote:
I have a data frame with 10 columns. In the last column is an alphaneumaric identifier. For most rows, this alphaneumaric identifier is unique to the file, however some of these alphanemeric idenitifiers occur in duplicate, triplicate or more. When they do occur more than once they are in consecutive rows, so when there is a duplicate or triplicate or quadruplicate (let's call them multiplicates), they are in consecutive rows. In column 7 there is an integer number (may or may not be unique. does not matter). I want to identify each multiple entries (multiplicates) occurring in column 10 and then for each multiplicate calculate the mean of the integers column 7. As an example, I will show just two columns: Length Identifier 321 A234 350 A234 340 A234 180 B123 198 B225 What I want to do (in the above example) is collapse all the A234's and report the mean to get this: Length Identifier 337 A234 180 B123 198 B225 Matthew
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Thanks, Tom. I was making a mistake looking at your example and that's what my problem was. Cool answer, works great. Thank you very much. Matthew
On 5/24/2016 4:23 PM, Tom Wright wrote:
Don't see that as being a big problem. If your data grows then dplyr
supports connections to external databases. Alternately if you just
want a mean, most databases can do that directly in SQL.
On Tue, May 24, 2016 at 4:17 PM, Matthew
<mccormack at molbio.mgh.harvard.edu
<mailto:mccormack at molbio.mgh.harvard.edu>> wrote:
Thank you very much, Tom.
This gets me thinking in the right direction.
One thing I should have mentioned that I did not is that the
number of rows in the data frame will be a little over 40,000 rows.
On 5/24/2016 4:08 PM, Tom Wright wrote:
Using dplyr
$ library(dplyr)
$ x<-data.frame(Length=c(321,350,340,180,198),
ID=c(rep('A234',3),'B123','B225') )
$ x %>% group_by(ID) %>% summarise(m=mean(Length))
On Tue, May 24, 2016 at 3:46 PM, Matthew
<mccormack at molbio.mgh.harvard.edu
<mailto:mccormack at molbio.mgh.harvard.edu>> wrote:
I have a data frame with 10 columns.
In the last column is an alphaneumaric identifier.
For most rows, this alphaneumaric identifier is unique to the
file, however some of these alphanemeric idenitifiers occur
in duplicate, triplicate or more. When they do occur more
than once they are in consecutive rows, so when there is a
duplicate or triplicate or quadruplicate (let's call them
multiplicates), they are in consecutive rows.
In column 7 there is an integer number (may or may not be
unique. does not matter).
I want to identify each multiple entries (multiplicates)
occurring in column 10 and then for each multiplicate
calculate the mean of the integers column 7.
As an example, I will show just two columns:
Length Identifier
321 A234
350 A234
340 A234
180 B123
198 B225
What I want to do (in the above example) is collapse all the
A234's and report the mean to get this:
Length Identifier
337 A234
180 B123
198 B225
Matthew
______________________________________________
R-help at r-project.org <mailto:R-help at r-project.org> mailing
list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible
code.
Thank you very much, Dan. These work great. Two more great answers to my question. Matthew
On 5/24/2016 4:15 PM, Nordlund, Dan (DSHS/RDA) wrote:
You have several options. 1. You could use the aggregate function. If your data frame is called DF, you could do something like with(DF, aggregate(Length, list(Identifier), mean)) 2. You could use the dplyr package like this library(dplyr) summarize(group_by(DF, Identifier), mean(Length)) Hope this is helpful, Dan Daniel Nordlund, PhD Research and Data Analysis Division Services & Enterprise Support Administration Washington State Department of Social and Health Services
-----Original Message----- From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Matthew Sent: Tuesday, May 24, 2016 12:47 PM To: r-help at r-project.org Subject: [R] identify duplicate entries in data frame and calculate mean I have a data frame with 10 columns. In the last column is an alphaneumaric identifier. For most rows, this alphaneumaric identifier is unique to the file, however some of these alphanemeric idenitifiers occur in duplicate, triplicate or more. When they do occur more than once they are in consecutive rows, so when there is a duplicate or triplicate or quadruplicate (let's call them multiplicates), they are in consecutive rows. In column 7 there is an integer number (may or may not be unique. does not matter). I want to identify each multiple entries (multiplicates) occurring in column 10 and then for each multiplicate calculate the mean of the integers column 7. As an example, I will show just two columns: Length Identifier 321 A234 350 A234 340 A234 180 B123 198 B225 What I want to do (in the above example) is collapse all the A234's and report the mean to get this: Length Identifier 337 A234 180 B123 198 B225 Matthew
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.