An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121012/5e3cac1a/attachment.pl>
average duplicated rows?
7 messages · Vining, Kelly, Pieter Schoonees, Rui Barradas +2 more
You will have to split() the data and unsplit() it after making the alterations. Have a look at the plyr package for such functions.
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of Vining, Kelly
Sent: Friday 12 October 2012 5:42
To: r-help at r-project.org
Subject: [R] average duplicated rows?
Dear useRs,
I have a slightly complicated data structure and am stuck trying to
extract what I need. I'm pasting an example of this data below. In some
cases, there are duplicates in the "gene_id" column because there are
two different "sample 1" values for a given "sample 2" value. Where
these duplicates exist, I need to average the corresponding "FL_EARLY"
values and retain the "FL_LATE" value and replace those two rows with a
row containing the "FL_EARLY" average so that I no longer have any
"gene_id" duplicates.
Seems like this is a job for some version of the apply function, but
searching and puzzling over this has not gotten me anywhere. Any help
will be much appreciated!
Example data:
gene_id sample_1 sample_2 FL_EARLY FL_LATE
763938 Eucgr.A00054 fl_S1E fl_S1L 13.170800 22.2605
763979 Eucgr.A00101 fl_S1E fl_S1L 0.367960 14.1202
1273243 Eucgr.A00101 fl_S2 fl_S1L 0.356625 14.1202
764169 Eucgr.A00350 fl_S1E fl_S1L 7.381070 43.9275
1273433 Eucgr.A00350 fl_S2 fl_S1L 10.674500 43.9275
1273669 Eucgr.A00650 fl_S2 fl_S1L 33.669100 50.0169
764480 Eucgr.A00744 fl_S1E fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744 fl_S2 fl_S1L 142.659000 747.2770
764595 Eucgr.A00890 fl_S1E fl_S1L 2.937760 14.9647
764683 Eucgr.A00990 fl_S1E fl_S1L 8.681250 48.5492
1273947 Eucgr.A00990 fl_S2 fl_S1L 10.553300 48.5492
764710 Eucgr.A01020 fl_S1E fl_S1L 0.000000 57.9273
1273974 Eucgr.A01020 fl_S2 fl_S1L 0.000000 57.9273
764756 Eucgr.A01073 fl_S1E fl_S1L 8.504710 101.1870
1274020 Eucgr.A01073 fl_S2 fl_S1L 5.400010 101.1870
764773 Eucgr.A01091 fl_S1E fl_S1L 3.448910 15.7756
764826 Eucgr.A01152 fl_S1E fl_S1L 69.565700 198.2320
764831 Eucgr.A01158 fl_S1E fl_S1L 7.265640 30.9565
764845 Eucgr.A01172 fl_S1E fl_S1L 3.248020 16.9127
764927 Eucgr.A01269 fl_S1E fl_S1L 18.710200 76.6918
--Kelly V.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
Hello, It could be a job for tapply, but I find it more suited for ?ave. dat <- read.table(text = " gene_id sample_1 sample_2 FL_EARLY FL_LATE 763938 Eucgr.A00054 fl_S1E fl_S1L 13.170800 22.2605 763979 Eucgr.A00101 fl_S1E fl_S1L 0.367960 14.1202 1273243 Eucgr.A00101 fl_S2 fl_S1L 0.356625 14.1202 764169 Eucgr.A00350 fl_S1E fl_S1L 7.381070 43.9275 1273433 Eucgr.A00350 fl_S2 fl_S1L 10.674500 43.9275 1273669 Eucgr.A00650 fl_S2 fl_S1L 33.669100 50.0169 764480 Eucgr.A00744 fl_S1E fl_S1L 132.429000 747.2770 1273744 Eucgr.A00744 fl_S2 fl_S1L 142.659000 747.2770 764595 Eucgr.A00890 fl_S1E fl_S1L 2.937760 14.9647 764683 Eucgr.A00990 fl_S1E fl_S1L 8.681250 48.5492 1273947 Eucgr.A00990 fl_S2 fl_S1L 10.553300 48.5492 764710 Eucgr.A01020 fl_S1E fl_S1L 0.000000 57.9273 1273974 Eucgr.A01020 fl_S2 fl_S1L 0.000000 57.9273 764756 Eucgr.A01073 fl_S1E fl_S1L 8.504710 101.1870 1274020 Eucgr.A01073 fl_S2 fl_S1L 5.400010 101.1870 764773 Eucgr.A01091 fl_S1E fl_S1L 3.448910 15.7756 764826 Eucgr.A01152 fl_S1E fl_S1L 69.565700 198.2320 764831 Eucgr.A01158 fl_S1E fl_S1L 7.265640 30.9565 764845 Eucgr.A01172 fl_S1E fl_S1L 3.248020 16.9127 764927 Eucgr.A01269 fl_S1E fl_S1L 18.710200 76.6918 ", header = TRUE) av <- ave(dat$FL_EARLY, dat$gene_id) dat$FLY_EARLY <- av Hope this helps, Rui Barradas Em 12-10-2012 16:41, Vining, Kelly escreveu:
Dear useRs,
I have a slightly complicated data structure and am stuck trying to extract what I need. I'm pasting an example of this data below. In some cases, there are duplicates in the "gene_id" column because there are two different "sample 1" values for a given "sample 2" value. Where these duplicates exist, I need to average the corresponding "FL_EARLY" values and retain the "FL_LATE" value and replace those two rows with a row containing the "FL_EARLY" average so that I no longer have any "gene_id" duplicates.
Seems like this is a job for some version of the apply function, but searching and puzzling over this has not gotten me anywhere. Any help will be much appreciated!
Example data:
gene_id sample_1 sample_2 FL_EARLY FL_LATE
763938 Eucgr.A00054 fl_S1E fl_S1L 13.170800 22.2605
763979 Eucgr.A00101 fl_S1E fl_S1L 0.367960 14.1202
1273243 Eucgr.A00101 fl_S2 fl_S1L 0.356625 14.1202
764169 Eucgr.A00350 fl_S1E fl_S1L 7.381070 43.9275
1273433 Eucgr.A00350 fl_S2 fl_S1L 10.674500 43.9275
1273669 Eucgr.A00650 fl_S2 fl_S1L 33.669100 50.0169
764480 Eucgr.A00744 fl_S1E fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744 fl_S2 fl_S1L 142.659000 747.2770
764595 Eucgr.A00890 fl_S1E fl_S1L 2.937760 14.9647
764683 Eucgr.A00990 fl_S1E fl_S1L 8.681250 48.5492
1273947 Eucgr.A00990 fl_S2 fl_S1L 10.553300 48.5492
764710 Eucgr.A01020 fl_S1E fl_S1L 0.000000 57.9273
1273974 Eucgr.A01020 fl_S2 fl_S1L 0.000000 57.9273
764756 Eucgr.A01073 fl_S1E fl_S1L 8.504710 101.1870
1274020 Eucgr.A01073 fl_S2 fl_S1L 5.400010 101.1870
764773 Eucgr.A01091 fl_S1E fl_S1L 3.448910 15.7756
764826 Eucgr.A01152 fl_S1E fl_S1L 69.565700 198.2320
764831 Eucgr.A01158 fl_S1E fl_S1L 7.265640 30.9565
764845 Eucgr.A01172 fl_S1E fl_S1L 3.248020 16.9127
764927 Eucgr.A01269 fl_S1E fl_S1L 18.710200 76.6918
--Kelly V.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
HI,
You can also try this:
dat1<-read.table(text="
??????????? gene_id sample_1 sample_2?? FL_EARLY? FL_LATE
763938? Eucgr.A00054?? fl_S1E?? fl_S1L? 13.170800? 22.2605
763979? Eucgr.A00101?? fl_S1E?? fl_S1L?? 0.367960? 14.1202
1273243 Eucgr.A00101??? fl_S2?? fl_S1L?? 0.356625? 14.1202
764169? Eucgr.A00350?? fl_S1E?? fl_S1L?? 7.381070? 43.9275
1273433 Eucgr.A00350??? fl_S2?? fl_S1L? 10.674500? 43.9275
1273669 Eucgr.A00650??? fl_S2?? fl_S1L? 33.669100? 50.0169
764480? Eucgr.A00744?? fl_S1E?? fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744??? fl_S2?? fl_S1L 142.659000 747.2770
764595? Eucgr.A00890?? fl_S1E?? fl_S1L?? 2.937760? 14.9647
764683? Eucgr.A00990?? fl_S1E?? fl_S1L?? 8.681250? 48.5492
1273947 Eucgr.A00990??? fl_S2?? fl_S1L? 10.553300? 48.5492
764710? Eucgr.A01020?? fl_S1E?? fl_S1L?? 0.000000? 57.9273
1273974 Eucgr.A01020??? fl_S2?? fl_S1L?? 0.000000? 57.9273
764756? Eucgr.A01073?? fl_S1E?? fl_S1L?? 8.504710 101.1870
1274020 Eucgr.A01073??? fl_S2?? fl_S1L?? 5.400010 101.1870
764773? Eucgr.A01091?? fl_S1E?? fl_S1L?? 3.448910? 15.7756
764826? Eucgr.A01152?? fl_S1E?? fl_S1L? 69.565700 198.2320
764831? Eucgr.A01158?? fl_S1E?? fl_S1L?? 7.265640? 30.9565
764845? Eucgr.A01172?? fl_S1E?? fl_S1L?? 3.248020? 16.9127
764927? Eucgr.A01269?? fl_S1E?? fl_S1L? 18.710200? 76.6918
",sep="",header=TRUE,stringsAsFactors=FALSE)
do.call(rbind,lapply(lapply(split(dat1,dat1$gene_id),`[`,4:5),colMeans))
??????????????? FL_EARLY? FL_LATE
#Eucgr.A00054? 13.1708000? 22.2605
#Eucgr.A00101?? 0.3622925? 14.1202
#Eucgr.A00350?? 9.0277850? 43.9275
#Eucgr.A00650? 33.6691000? 50.0169
#Eucgr.A00744 137.5440000 747.2770
#Eucgr.A00890?? 2.9377600? 14.9647
#Eucgr.A00990?? 9.6172750? 48.5492
#Eucgr.A01020?? 0.0000000? 57.9273
#Eucgr.A01073?? 6.9523600 101.1870
#Eucgr.A01091?? 3.4489100? 15.7756
#Eucgr.A01152? 69.5657000 198.2320
#Eucgr.A01158?? 7.2656400? 30.9565
#Eucgr.A01172?? 3.2480200? 16.9127
#Eucgr.A01269? 18.7102000? 76.6918
In addition to aggregate(), ddply() etc.
library(data.table)
dat2<-data.table(dat1)
dat3<-dat2[,list(FL_EARLY=mean(FL_EARLY),FL_LATE=mean(FL_LATE)),list(gene_id)]
#aggregate()
dat4<-with(dat1,aggregate(cbind(FL_EARLY,FL_LATE),list(gene_id),FUN=mean))
colnames(dat4)<-colnames(dat1)[c(1,4,5)]
#ddply()
library(plyr)
dat5<-ddply(dat1,.(gene_id),colwise(mean,c("FL_EARLY","FL_LATE")))
A.K.
----- Original Message -----
From: "Vining, Kelly" <Kelly.Vining at oregonstate.edu>
To: "r-help at r-project.org" <r-help at r-project.org>
Cc:
Sent: Friday, October 12, 2012 11:41 AM
Subject: [R] average duplicated rows?
Dear useRs,
I have a slightly complicated data structure and am stuck trying to extract what I need. I'm pasting an example of this data below. In some cases, there are duplicates in the "gene_id" column because there are two different "sample 1" values for a given "sample 2" value. Where these duplicates exist, I need to average the corresponding "FL_EARLY" values and retain the "FL_LATE" value and replace those two rows with a row containing the "FL_EARLY" average so that I no longer have any "gene_id" duplicates.
Seems like this is a job for some version of the apply function, but searching and puzzling over this has not gotten me anywhere. Any help will be much appreciated!
Example data:
? ? ? ? ? ? gene_id sample_1 sample_2? FL_EARLY? FL_LATE
763938? Eucgr.A00054? fl_S1E? fl_S1L? 13.170800? 22.2605
763979? Eucgr.A00101? fl_S1E? fl_S1L? 0.367960? 14.1202
1273243 Eucgr.A00101? ? fl_S2? fl_S1L? 0.356625? 14.1202
764169? Eucgr.A00350? fl_S1E? fl_S1L? 7.381070? 43.9275
1273433 Eucgr.A00350? ? fl_S2? fl_S1L? 10.674500? 43.9275
1273669 Eucgr.A00650? ? fl_S2? fl_S1L? 33.669100? 50.0169
764480? Eucgr.A00744? fl_S1E? fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744? ? fl_S2? fl_S1L 142.659000 747.2770
764595? Eucgr.A00890? fl_S1E? fl_S1L? 2.937760? 14.9647
764683? Eucgr.A00990? fl_S1E? fl_S1L? 8.681250? 48.5492
1273947 Eucgr.A00990? ? fl_S2? fl_S1L? 10.553300? 48.5492
764710? Eucgr.A01020? fl_S1E? fl_S1L? 0.000000? 57.9273
1273974 Eucgr.A01020? ? fl_S2? fl_S1L? 0.000000? 57.9273
764756? Eucgr.A01073? fl_S1E? fl_S1L? 8.504710 101.1870
1274020 Eucgr.A01073? ? fl_S2? fl_S1L? 5.400010 101.1870
764773? Eucgr.A01091? fl_S1E? fl_S1L? 3.448910? 15.7756
764826? Eucgr.A01152? fl_S1E? fl_S1L? 69.565700 198.2320
764831? Eucgr.A01158? fl_S1E? fl_S1L? 7.265640? 30.9565
764845? Eucgr.A01172? fl_S1E? fl_S1L? 3.248020? 16.9127
764927? Eucgr.A01269? fl_S1E? fl_S1L? 18.710200? 76.6918
--Kelly V.
??? [[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
HI, My earlier solutions averaged FL_EARLY values for duplicated "gene_ids" so that the resultant dataframe had unique rows.? But, if you want to keep the duplicated rows with average values, you can also try this: dat$FL_EARLY<-unlist(lapply(lapply(split(dat,dat$gene_id),`[`,4),function(x) rep(colMeans(x),each=nrow(x))),use.names=F) ?head(dat) #???????????? gene_id sample_1 sample_2?? FL_EARLY FL_LATE #763938? Eucgr.A00054?? fl_S1E?? fl_S1L 13.1708000 22.2605 #763979? Eucgr.A00101?? fl_S1E?? fl_S1L? 0.3622925 14.1202 #1273243 Eucgr.A00101??? fl_S2?? fl_S1L? 0.3622925 14.1202 #764169? Eucgr.A00350?? fl_S1E?? fl_S1L? 9.0277850 43.9275 #1273433 Eucgr.A00350??? fl_S2?? fl_S1L? 9.0277850 43.9275 #1273669 Eucgr.A00650??? fl_S2?? fl_S1L 33.6691000 50.0169 A.K. ----- Original Message ----- From: Rui Barradas <ruipbarradas at sapo.pt> To: "Vining, Kelly" <Kelly.Vining at oregonstate.edu> Cc: "r-help at r-project.org" <r-help at r-project.org> Sent: Friday, October 12, 2012 1:10 PM Subject: Re: [R] average duplicated rows? Hello, It could be a job for tapply, but I find it more suited for ?ave. dat <- read.table(text = " ? gene_id sample_1 sample_2? FL_EARLY? FL_LATE 763938? Eucgr.A00054? fl_S1E? fl_S1L? 13.170800? 22.2605 763979? Eucgr.A00101? fl_S1E? fl_S1L? 0.367960? 14.1202 1273243 Eucgr.A00101? ? fl_S2? fl_S1L? 0.356625? 14.1202 764169? Eucgr.A00350? fl_S1E? fl_S1L? 7.381070? 43.9275 1273433 Eucgr.A00350? ? fl_S2? fl_S1L? 10.674500? 43.9275 1273669 Eucgr.A00650? ? fl_S2? fl_S1L? 33.669100? 50.0169 764480? Eucgr.A00744? fl_S1E? fl_S1L 132.429000 747.2770 1273744 Eucgr.A00744? ? fl_S2? fl_S1L 142.659000 747.2770 764595? Eucgr.A00890? fl_S1E? fl_S1L? 2.937760? 14.9647 764683? Eucgr.A00990? fl_S1E? fl_S1L? 8.681250? 48.5492 1273947 Eucgr.A00990? ? fl_S2? fl_S1L? 10.553300? 48.5492 764710? Eucgr.A01020? fl_S1E? fl_S1L? 0.000000? 57.9273 1273974 Eucgr.A01020? ? fl_S2? fl_S1L? 0.000000? 57.9273 764756? Eucgr.A01073? fl_S1E? fl_S1L? 8.504710 101.1870 1274020 Eucgr.A01073? ? fl_S2? fl_S1L? 5.400010 101.1870 764773? Eucgr.A01091? fl_S1E? fl_S1L? 3.448910? 15.7756 764826? Eucgr.A01152? fl_S1E? fl_S1L? 69.565700 198.2320 764831? Eucgr.A01158? fl_S1E? fl_S1L? 7.265640? 30.9565 764845? Eucgr.A01172? fl_S1E? fl_S1L? 3.248020? 16.9127 764927? Eucgr.A01269? fl_S1E? fl_S1L? 18.710200? 76.6918 ", header = TRUE) av <- ave(dat$FL_EARLY, dat$gene_id) dat$FLY_EARLY <- av Hope this helps, Rui Barradas Em 12-10-2012 16:41, Vining, Kelly escreveu:
Dear useRs, I have a slightly complicated data structure and am stuck trying to extract what I need. I'm pasting an example of this data below. In some cases, there are duplicates in the "gene_id" column because there are two different "sample 1" values for a given "sample 2" value. Where these duplicates exist, I need to average the corresponding "FL_EARLY" values and retain the "FL_LATE" value and replace those two rows with a row containing the "FL_EARLY" average so that I no longer have any "gene_id" duplicates. Seems like this is a job for some version of the apply function, but searching and puzzling over this has not gotten me anywhere. Any help will be much appreciated! Example data: ? ? ? ? ? ? ? gene_id sample_1 sample_2? FL_EARLY? FL_LATE 763938? Eucgr.A00054? fl_S1E? fl_S1L? 13.170800? 22.2605 763979? Eucgr.A00101? fl_S1E? fl_S1L? 0.367960? 14.1202 1273243 Eucgr.A00101? ? fl_S2? fl_S1L? 0.356625? 14.1202 764169? Eucgr.A00350? fl_S1E? fl_S1L? 7.381070? 43.9275 1273433 Eucgr.A00350? ? fl_S2? fl_S1L? 10.674500? 43.9275 1273669 Eucgr.A00650? ? fl_S2? fl_S1L? 33.669100? 50.0169 764480? Eucgr.A00744? fl_S1E? fl_S1L 132.429000 747.2770 1273744 Eucgr.A00744? ? fl_S2? fl_S1L 142.659000 747.2770 764595? Eucgr.A00890? fl_S1E? fl_S1L? 2.937760? 14.9647 764683? Eucgr.A00990? fl_S1E? fl_S1L? 8.681250? 48.5492 1273947 Eucgr.A00990? ? fl_S2? fl_S1L? 10.553300? 48.5492 764710? Eucgr.A01020? fl_S1E? fl_S1L? 0.000000? 57.9273 1273974 Eucgr.A01020? ? fl_S2? fl_S1L? 0.000000? 57.9273 764756? Eucgr.A01073? fl_S1E? fl_S1L? 8.504710 101.1870 1274020 Eucgr.A01073? ? fl_S2? fl_S1L? 5.400010 101.1870 764773? Eucgr.A01091? fl_S1E? fl_S1L? 3.448910? 15.7756 764826? Eucgr.A01152? fl_S1E? fl_S1L? 69.565700 198.2320 764831? Eucgr.A01158? fl_S1E? fl_S1L? 7.265640? 30.9565 764845? Eucgr.A01172? fl_S1E? fl_S1L? 3.248020? 16.9127 764927? Eucgr.A01269? fl_S1E? fl_S1L? 18.710200? 76.6918 --Kelly V. ??? [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
2 days later
Hi
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of Pieter Schoonees Sent: Friday, October 12, 2012 6:19 PM To: Vining, Kelly; r-help at r-project.org Subject: Re: [R] average duplicated rows? You will have to split() the data and unsplit() it after making the alterations. Have a look at the plyr package for such functions.
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of Vining, Kelly Sent: Friday 12 October 2012 5:42 To: r-help at r-project.org Subject: [R] average duplicated rows? Dear useRs, I have a slightly complicated data structure and am stuck trying to extract what I need. I'm pasting an example of this data below. In some cases, there are duplicates in the "gene_id" column because
there
are two different "sample 1" values for a given "sample 2" value. Where these duplicates exist, I need to average the corresponding
"FL_EARLY"
values and retain the "FL_LATE" value and replace those two rows with a row containing the "FL_EARLY" average so that I no longer have any "gene_id" duplicates. Seems like this is a job for some version of the apply function, but searching and puzzling over this has not gotten me anywhere. Any help will be much appreciated!
Aggregate is designed for that
ex.ag<-aggregate(ex[,c("FL_EARLY", "FL_LATE")], list(ex$gene_id), mean)
you will lose sample1 and 2 but you did not mention you want to retain them and how.
Regards
Petr
Example data:
gene_id sample_1 sample_2 FL_EARLY FL_LATE
763938 Eucgr.A00054 fl_S1E fl_S1L 13.170800 22.2605
763979 Eucgr.A00101 fl_S1E fl_S1L 0.367960 14.1202
1273243 Eucgr.A00101 fl_S2 fl_S1L 0.356625 14.1202
764169 Eucgr.A00350 fl_S1E fl_S1L 7.381070 43.9275
1273433 Eucgr.A00350 fl_S2 fl_S1L 10.674500 43.9275
1273669 Eucgr.A00650 fl_S2 fl_S1L 33.669100 50.0169
764480 Eucgr.A00744 fl_S1E fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744 fl_S2 fl_S1L 142.659000 747.2770
764595 Eucgr.A00890 fl_S1E fl_S1L 2.937760 14.9647
764683 Eucgr.A00990 fl_S1E fl_S1L 8.681250 48.5492
1273947 Eucgr.A00990 fl_S2 fl_S1L 10.553300 48.5492
764710 Eucgr.A01020 fl_S1E fl_S1L 0.000000 57.9273
1273974 Eucgr.A01020 fl_S2 fl_S1L 0.000000 57.9273
764756 Eucgr.A01073 fl_S1E fl_S1L 8.504710 101.1870
1274020 Eucgr.A01073 fl_S2 fl_S1L 5.400010 101.1870
764773 Eucgr.A01091 fl_S1E fl_S1L 3.448910 15.7756
764826 Eucgr.A01152 fl_S1E fl_S1L 69.565700 198.2320
764831 Eucgr.A01158 fl_S1E fl_S1L 7.265640 30.9565
764845 Eucgr.A01172 fl_S1E fl_S1L 3.248020 16.9127
764927 Eucgr.A01269 fl_S1E fl_S1L 18.710200 76.6918
--Kelly V.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
Thanks much! This was very helpful. --Kelly -----Original Message----- From: arun [mailto:smartpink111 at yahoo.com] Sent: Friday, October 12, 2012 7:23 PM To: Vining, Kelly Cc: Rui Barradas; R help Subject: Re: [R] average duplicated rows? HI, My earlier solutions averaged FL_EARLY values for duplicated "gene_ids" so that the resultant dataframe had unique rows.? But, if you want to keep the duplicated rows with average values, you can also try this: dat$FL_EARLY<-unlist(lapply(lapply(split(dat,dat$gene_id),`[`,4),function(x) rep(colMeans(x),each=nrow(x))),use.names=F) ?head(dat) #???????????? gene_id sample_1 sample_2?? FL_EARLY FL_LATE #763938? Eucgr.A00054?? fl_S1E?? fl_S1L 13.1708000 22.2605 #763979? Eucgr.A00101?? fl_S1E?? fl_S1L? 0.3622925 14.1202 #1273243 Eucgr.A00101??? fl_S2?? fl_S1L? 0.3622925 14.1202 #764169? Eucgr.A00350?? fl_S1E?? fl_S1L? 9.0277850 43.9275 #1273433 Eucgr.A00350??? fl_S2?? fl_S1L? 9.0277850 43.9275 #1273669 Eucgr.A00650??? fl_S2?? fl_S1L 33.6691000 50.0169 A.K. ----- Original Message ----- From: Rui Barradas <ruipbarradas at sapo.pt> To: "Vining, Kelly" <Kelly.Vining at oregonstate.edu> Cc: "r-help at r-project.org" <r-help at r-project.org> Sent: Friday, October 12, 2012 1:10 PM Subject: Re: [R] average duplicated rows? Hello, It could be a job for tapply, but I find it more suited for ?ave. dat <- read.table(text = " ? gene_id sample_1 sample_2? FL_EARLY? FL_LATE 763938? Eucgr.A00054? fl_S1E? fl_S1L? 13.170800? 22.2605 763979? Eucgr.A00101? fl_S1E? fl_S1L? 0.367960? 14.1202 1273243 Eucgr.A00101? ? fl_S2? fl_S1L? 0.356625? 14.1202 764169? Eucgr.A00350? fl_S1E? fl_S1L? 7.381070? 43.9275 1273433 Eucgr.A00350? ? fl_S2? fl_S1L? 10.674500? 43.9275 1273669 Eucgr.A00650? ? fl_S2? fl_S1L? 33.669100? 50.0169 764480? Eucgr.A00744? fl_S1E? fl_S1L 132.429000 747.2770 1273744 Eucgr.A00744? ? fl_S2? fl_S1L 142.659000 747.2770 764595? Eucgr.A00890? fl_S1E? fl_S1L? 2.937760? 14.9647 764683? Eucgr.A00990? fl_S1E? fl_S1L? 8.681250? 48.5492 1273947 Eucgr.A00990? ? fl_S2? fl_S1L? 10.553300? 48.5492 764710? Eucgr.A01020? fl_S1E? fl_S1L? 0.000000? 57.9273 1273974 Eucgr.A01020? ? fl_S2? fl_S1L? 0.000000? 57.9273 764756? Eucgr.A01073? fl_S1E? fl_S1L? 8.504710 101.1870 1274020 Eucgr.A01073? ? fl_S2? fl_S1L? 5.400010 101.1870 764773? Eucgr.A01091? fl_S1E? fl_S1L? 3.448910? 15.7756 764826? Eucgr.A01152? fl_S1E? fl_S1L? 69.565700 198.2320 764831? Eucgr.A01158? fl_S1E? fl_S1L? 7.265640? 30.9565 764845? Eucgr.A01172? fl_S1E? fl_S1L? 3.248020? 16.9127 764927? Eucgr.A01269? fl_S1E? fl_S1L? 18.710200? 76.6918 ", header = TRUE) av <- ave(dat$FL_EARLY, dat$gene_id) dat$FLY_EARLY <- av Hope this helps, Rui Barradas Em 12-10-2012 16:41, Vining, Kelly escreveu:
Dear useRs, I have a slightly complicated data structure and am stuck trying to extract what I need. I'm pasting an example of this data below. In some cases, there are duplicates in the "gene_id" column because there are two different "sample 1" values for a given "sample 2" value. Where these duplicates exist, I need to average the corresponding "FL_EARLY" values and retain the "FL_LATE" value and replace those two rows with a row containing the "FL_EARLY" average so that I no longer have any "gene_id" duplicates.
Seems like this is a job for some version of the apply function, but searching and puzzling over this has not gotten me anywhere. Any help will be much appreciated! Example data: ? ? ? ? ? ? ? gene_id sample_1 sample_2? FL_EARLY? FL_LATE 763938? Eucgr.A00054? fl_S1E? fl_S1L? 13.170800? 22.2605 763979? Eucgr.A00101? fl_S1E? fl_S1L? 0.367960? 14.1202 1273243 Eucgr.A00101? ? fl_S2? fl_S1L? 0.356625? 14.1202 764169? Eucgr.A00350? fl_S1E? fl_S1L? 7.381070? 43.9275 1273433 Eucgr.A00350? ? fl_S2? fl_S1L? 10.674500? 43.9275 1273669 Eucgr.A00650? ? fl_S2? fl_S1L? 33.669100? 50.0169 764480? Eucgr.A00744? fl_S1E? fl_S1L 132.429000 747.2770 1273744 Eucgr.A00744? ? fl_S2? fl_S1L 142.659000 747.2770 764595? Eucgr.A00890? fl_S1E? fl_S1L? 2.937760? 14.9647 764683? Eucgr.A00990? fl_S1E? fl_S1L? 8.681250? 48.5492 1273947 Eucgr.A00990? ? fl_S2? fl_S1L? 10.553300? 48.5492 764710? Eucgr.A01020? fl_S1E? fl_S1L? 0.000000? 57.9273 1273974 Eucgr.A01020? ? fl_S2? fl_S1L? 0.000000? 57.9273 764756? Eucgr.A01073? fl_S1E? fl_S1L? 8.504710 101.1870 1274020 Eucgr.A01073? ? fl_S2? fl_S1L? 5.400010 101.1870 764773? Eucgr.A01091? fl_S1E? fl_S1L? 3.448910? 15.7756 764826? Eucgr.A01152? fl_S1E? fl_S1L? 69.565700 198.2320 764831? Eucgr.A01158? fl_S1E? fl_S1L? 7.265640? 30.9565 764845? Eucgr.A01172? fl_S1E? fl_S1L? 3.248020? 16.9127 764927? Eucgr.A01269? fl_S1E? fl_S1L? 18.710200? 76.6918 --Kelly V. ??? [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.