average duplicated rows?

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121012/5e3cac1a/attachment.pl>
You will have to split() the data and unsplit() it after making the alterations. Have a look at the plyr package for such functions.
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of Vining, Kelly
Sent: Friday 12 October 2012 5:42
To: r-help at r-project.org
Subject: [R] average duplicated rows?

Dear useRs,

I have a slightly complicated data structure and am stuck trying to
extract what I need. I'm pasting an example of this data below. In some
cases, there are duplicates in the "gene_id" column because there are
two different "sample 1" values for a given "sample 2" value. Where
these duplicates exist, I need to average the corresponding "FL_EARLY"
values and retain the "FL_LATE" value and replace those two rows with a
row containing the "FL_EARLY" average so that I no longer have any
"gene_id" duplicates.

Seems like this is a job for some version of the apply function, but
searching and puzzling over this has not gotten me anywhere. Any help
will be much appreciated!

Example data:

             gene_id sample_1 sample_2   FL_EARLY  FL_LATE
763938  Eucgr.A00054   fl_S1E   fl_S1L  13.170800  22.2605
763979  Eucgr.A00101   fl_S1E   fl_S1L   0.367960  14.1202
1273243 Eucgr.A00101    fl_S2   fl_S1L   0.356625  14.1202
764169  Eucgr.A00350   fl_S1E   fl_S1L   7.381070  43.9275
1273433 Eucgr.A00350    fl_S2   fl_S1L  10.674500  43.9275
1273669 Eucgr.A00650    fl_S2   fl_S1L  33.669100  50.0169
764480  Eucgr.A00744   fl_S1E   fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744    fl_S2   fl_S1L 142.659000 747.2770
764595  Eucgr.A00890   fl_S1E   fl_S1L   2.937760  14.9647
764683  Eucgr.A00990   fl_S1E   fl_S1L   8.681250  48.5492
1273947 Eucgr.A00990    fl_S2   fl_S1L  10.553300  48.5492
764710  Eucgr.A01020   fl_S1E   fl_S1L   0.000000  57.9273
1273974 Eucgr.A01020    fl_S2   fl_S1L   0.000000  57.9273
764756  Eucgr.A01073   fl_S1E   fl_S1L   8.504710 101.1870
1274020 Eucgr.A01073    fl_S2   fl_S1L   5.400010 101.1870
764773  Eucgr.A01091   fl_S1E   fl_S1L   3.448910  15.7756
764826  Eucgr.A01152   fl_S1E   fl_S1L  69.565700 198.2320
764831  Eucgr.A01158   fl_S1E   fl_S1L   7.265640  30.9565
764845  Eucgr.A01172   fl_S1E   fl_S1L   3.248020  16.9127
764927  Eucgr.A01269   fl_S1E   fl_S1L  18.710200  76.6918

--Kelly V.

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html
and provide commented, minimal, self-contained, reproducible code.
Hello,

It could be a job for tapply, but I find it more suited for ?ave.

dat <- read.table(text = "
  gene_id sample_1 sample_2   FL_EARLY  FL_LATE
763938  Eucgr.A00054   fl_S1E   fl_S1L  13.170800  22.2605
763979  Eucgr.A00101   fl_S1E   fl_S1L   0.367960  14.1202
1273243 Eucgr.A00101    fl_S2   fl_S1L   0.356625  14.1202
764169  Eucgr.A00350   fl_S1E   fl_S1L   7.381070  43.9275
1273433 Eucgr.A00350    fl_S2   fl_S1L  10.674500  43.9275
1273669 Eucgr.A00650    fl_S2   fl_S1L  33.669100  50.0169
764480  Eucgr.A00744   fl_S1E   fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744    fl_S2   fl_S1L 142.659000 747.2770
764595  Eucgr.A00890   fl_S1E   fl_S1L   2.937760  14.9647
764683  Eucgr.A00990   fl_S1E   fl_S1L   8.681250  48.5492
1273947 Eucgr.A00990    fl_S2   fl_S1L  10.553300  48.5492
764710  Eucgr.A01020   fl_S1E   fl_S1L   0.000000  57.9273
1273974 Eucgr.A01020    fl_S2   fl_S1L   0.000000  57.9273
764756  Eucgr.A01073   fl_S1E   fl_S1L   8.504710 101.1870
1274020 Eucgr.A01073    fl_S2   fl_S1L   5.400010 101.1870
764773  Eucgr.A01091   fl_S1E   fl_S1L   3.448910  15.7756
764826  Eucgr.A01152   fl_S1E   fl_S1L  69.565700 198.2320
764831  Eucgr.A01158   fl_S1E   fl_S1L   7.265640  30.9565
764845  Eucgr.A01172   fl_S1E   fl_S1L   3.248020  16.9127
764927  Eucgr.A01269   fl_S1E   fl_S1L  18.710200  76.6918
", header = TRUE)

av <- ave(dat$FL_EARLY, dat$gene_id)
dat$FLY_EARLY <- av

Hope this helps,

Rui Barradas
Em 12-10-2012 16:41, Vining, Kelly escreveu:
Dear useRs,

I have a slightly complicated data structure and am stuck trying to extract what I need. I'm pasting an example of this data below. In some cases, there are duplicates in the "gene_id" column because there are two different "sample 1" values for a given "sample 2" value. Where these duplicates exist, I need to average the corresponding "FL_EARLY" values and retain the "FL_LATE" value and replace those two rows with a row containing the "FL_EARLY" average so that I no longer have any "gene_id" duplicates.

Seems like this is a job for some version of the apply function, but searching and puzzling over this has not gotten me anywhere. Any help will be much appreciated!

Example data:

              gene_id sample_1 sample_2   FL_EARLY  FL_LATE
763938  Eucgr.A00054   fl_S1E   fl_S1L  13.170800  22.2605
763979  Eucgr.A00101   fl_S1E   fl_S1L   0.367960  14.1202
1273243 Eucgr.A00101    fl_S2   fl_S1L   0.356625  14.1202
764169  Eucgr.A00350   fl_S1E   fl_S1L   7.381070  43.9275
1273433 Eucgr.A00350    fl_S2   fl_S1L  10.674500  43.9275
1273669 Eucgr.A00650    fl_S2   fl_S1L  33.669100  50.0169
764480  Eucgr.A00744   fl_S1E   fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744    fl_S2   fl_S1L 142.659000 747.2770
764595  Eucgr.A00890   fl_S1E   fl_S1L   2.937760  14.9647
764683  Eucgr.A00990   fl_S1E   fl_S1L   8.681250  48.5492
1273947 Eucgr.A00990    fl_S2   fl_S1L  10.553300  48.5492
764710  Eucgr.A01020   fl_S1E   fl_S1L   0.000000  57.9273
1273974 Eucgr.A01020    fl_S2   fl_S1L   0.000000  57.9273
764756  Eucgr.A01073   fl_S1E   fl_S1L   8.504710 101.1870
1274020 Eucgr.A01073    fl_S2   fl_S1L   5.400010 101.1870
764773  Eucgr.A01091   fl_S1E   fl_S1L   3.448910  15.7756
764826  Eucgr.A01152   fl_S1E   fl_S1L  69.565700 198.2320
764831  Eucgr.A01158   fl_S1E   fl_S1L   7.265640  30.9565
764845  Eucgr.A01172   fl_S1E   fl_S1L   3.248020  16.9127
764927  Eucgr.A01269   fl_S1E   fl_S1L  18.710200  76.6918

--Kelly V.

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
HI,

You can also try this:
dat1<-read.table(text="
??????????? gene_id sample_1 sample_2?? FL_EARLY? FL_LATE
763938? Eucgr.A00054?? fl_S1E?? fl_S1L? 13.170800? 22.2605
763979? Eucgr.A00101?? fl_S1E?? fl_S1L?? 0.367960? 14.1202
1273243 Eucgr.A00101??? fl_S2?? fl_S1L?? 0.356625? 14.1202
764169? Eucgr.A00350?? fl_S1E?? fl_S1L?? 7.381070? 43.9275
1273433 Eucgr.A00350??? fl_S2?? fl_S1L? 10.674500? 43.9275
1273669 Eucgr.A00650??? fl_S2?? fl_S1L? 33.669100? 50.0169
764480? Eucgr.A00744?? fl_S1E?? fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744??? fl_S2?? fl_S1L 142.659000 747.2770
764595? Eucgr.A00890?? fl_S1E?? fl_S1L?? 2.937760? 14.9647
764683? Eucgr.A00990?? fl_S1E?? fl_S1L?? 8.681250? 48.5492
1273947 Eucgr.A00990??? fl_S2?? fl_S1L? 10.553300? 48.5492
764710? Eucgr.A01020?? fl_S1E?? fl_S1L?? 0.000000? 57.9273
1273974 Eucgr.A01020??? fl_S2?? fl_S1L?? 0.000000? 57.9273
764756? Eucgr.A01073?? fl_S1E?? fl_S1L?? 8.504710 101.1870
1274020 Eucgr.A01073??? fl_S2?? fl_S1L?? 5.400010 101.1870
764773? Eucgr.A01091?? fl_S1E?? fl_S1L?? 3.448910? 15.7756
764826? Eucgr.A01152?? fl_S1E?? fl_S1L? 69.565700 198.2320
764831? Eucgr.A01158?? fl_S1E?? fl_S1L?? 7.265640? 30.9565
764845? Eucgr.A01172?? fl_S1E?? fl_S1L?? 3.248020? 16.9127
764927? Eucgr.A01269?? fl_S1E?? fl_S1L? 18.710200? 76.6918 
",sep="",header=TRUE,stringsAsFactors=FALSE)

do.call(rbind,lapply(lapply(split(dat1,dat1$gene_id),`[`,4:5),colMeans))
??????????????? FL_EARLY? FL_LATE
#Eucgr.A00054? 13.1708000? 22.2605
#Eucgr.A00101?? 0.3622925? 14.1202
#Eucgr.A00350?? 9.0277850? 43.9275
#Eucgr.A00650? 33.6691000? 50.0169
#Eucgr.A00744 137.5440000 747.2770
#Eucgr.A00890?? 2.9377600? 14.9647
#Eucgr.A00990?? 9.6172750? 48.5492
#Eucgr.A01020?? 0.0000000? 57.9273
#Eucgr.A01073?? 6.9523600 101.1870
#Eucgr.A01091?? 3.4489100? 15.7756
#Eucgr.A01152? 69.5657000 198.2320
#Eucgr.A01158?? 7.2656400? 30.9565
#Eucgr.A01172?? 3.2480200? 16.9127
#Eucgr.A01269? 18.7102000? 76.6918

In addition to aggregate(), ddply() etc.

library(data.table)
dat2<-data.table(dat1)
dat3<-dat2[,list(FL_EARLY=mean(FL_EARLY),FL_LATE=mean(FL_LATE)),list(gene_id)]
#aggregate()
dat4<-with(dat1,aggregate(cbind(FL_EARLY,FL_LATE),list(gene_id),FUN=mean))
colnames(dat4)<-colnames(dat1)[c(1,4,5)]
#ddply()
library(plyr)
dat5<-ddply(dat1,.(gene_id),colwise(mean,c("FL_EARLY","FL_LATE")))

A.K.

----- Original Message -----
From: "Vining, Kelly" <Kelly.Vining at oregonstate.edu>
To: "r-help at r-project.org" <r-help at r-project.org>
Cc: 
Sent: Friday, October 12, 2012 11:41 AM
Subject: [R] average duplicated rows?

Dear useRs,

I have a slightly complicated data structure and am stuck trying to extract what I need. I'm pasting an example of this data below. In some cases, there are duplicates in the "gene_id" column because there are two different "sample 1" values for a given "sample 2" value. Where these duplicates exist, I need to average the corresponding "FL_EARLY" values and retain the "FL_LATE" value and replace those two rows with a row containing the "FL_EARLY" average so that I no longer have any "gene_id" duplicates.

Seems like this is a job for some version of the apply function, but searching and puzzling over this has not gotten me anywhere. Any help will be much appreciated!

Example data:

? ? ? ? ? ?  gene_id sample_1 sample_2?  FL_EARLY? FL_LATE
763938? Eucgr.A00054?  fl_S1E?  fl_S1L? 13.170800? 22.2605
763979? Eucgr.A00101?  fl_S1E?  fl_S1L?  0.367960? 14.1202
1273243 Eucgr.A00101? ? fl_S2?  fl_S1L?  0.356625? 14.1202
764169? Eucgr.A00350?  fl_S1E?  fl_S1L?  7.381070? 43.9275
1273433 Eucgr.A00350? ? fl_S2?  fl_S1L? 10.674500? 43.9275
1273669 Eucgr.A00650? ? fl_S2?  fl_S1L? 33.669100? 50.0169
764480? Eucgr.A00744?  fl_S1E?  fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744? ? fl_S2?  fl_S1L 142.659000 747.2770
764595? Eucgr.A00890?  fl_S1E?  fl_S1L?  2.937760? 14.9647
764683? Eucgr.A00990?  fl_S1E?  fl_S1L?  8.681250? 48.5492
1273947 Eucgr.A00990? ? fl_S2?  fl_S1L? 10.553300? 48.5492
764710? Eucgr.A01020?  fl_S1E?  fl_S1L?  0.000000? 57.9273
1273974 Eucgr.A01020? ? fl_S2?  fl_S1L?  0.000000? 57.9273
764756? Eucgr.A01073?  fl_S1E?  fl_S1L?  8.504710 101.1870
1274020 Eucgr.A01073? ? fl_S2?  fl_S1L?  5.400010 101.1870
764773? Eucgr.A01091?  fl_S1E?  fl_S1L?  3.448910? 15.7756
764826? Eucgr.A01152?  fl_S1E?  fl_S1L? 69.565700 198.2320
764831? Eucgr.A01158?  fl_S1E?  fl_S1L?  7.265640? 30.9565
764845? Eucgr.A01172?  fl_S1E?  fl_S1L?  3.248020? 16.9127
764927? Eucgr.A01269?  fl_S1E?  fl_S1L? 18.710200? 76.6918

--Kelly V.

??? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
HI,

My earlier solutions averaged FL_EARLY values for duplicated "gene_ids" so that the resultant dataframe had unique rows.? But, if you want to keep the duplicated rows with average values, you can also try this:
dat$FL_EARLY<-unlist(lapply(lapply(split(dat,dat$gene_id),`[`,4),function(x) rep(colMeans(x),each=nrow(x))),use.names=F)
?head(dat)
#???????????? gene_id sample_1 sample_2?? FL_EARLY FL_LATE
#763938? Eucgr.A00054?? fl_S1E?? fl_S1L 13.1708000 22.2605
#763979? Eucgr.A00101?? fl_S1E?? fl_S1L? 0.3622925 14.1202
#1273243 Eucgr.A00101??? fl_S2?? fl_S1L? 0.3622925 14.1202
#764169? Eucgr.A00350?? fl_S1E?? fl_S1L? 9.0277850 43.9275
#1273433 Eucgr.A00350??? fl_S2?? fl_S1L? 9.0277850 43.9275
#1273669 Eucgr.A00650??? fl_S2?? fl_S1L 33.6691000 50.0169
A.K.

----- Original Message -----
From: Rui Barradas <ruipbarradas at sapo.pt>
To: "Vining, Kelly" <Kelly.Vining at oregonstate.edu>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Sent: Friday, October 12, 2012 1:10 PM
Subject: Re: [R] average duplicated rows?

Hello,

It could be a job for tapply, but I find it more suited for ?ave.

dat <- read.table(text = "
? gene_id sample_1 sample_2?  FL_EARLY? FL_LATE
763938? Eucgr.A00054?  fl_S1E?  fl_S1L? 13.170800? 22.2605
763979? Eucgr.A00101?  fl_S1E?  fl_S1L?  0.367960? 14.1202
1273243 Eucgr.A00101? ? fl_S2?  fl_S1L?  0.356625? 14.1202
764169? Eucgr.A00350?  fl_S1E?  fl_S1L?  7.381070? 43.9275
1273433 Eucgr.A00350? ? fl_S2?  fl_S1L? 10.674500? 43.9275
1273669 Eucgr.A00650? ? fl_S2?  fl_S1L? 33.669100? 50.0169
764480? Eucgr.A00744?  fl_S1E?  fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744? ? fl_S2?  fl_S1L 142.659000 747.2770
764595? Eucgr.A00890?  fl_S1E?  fl_S1L?  2.937760? 14.9647
764683? Eucgr.A00990?  fl_S1E?  fl_S1L?  8.681250? 48.5492
1273947 Eucgr.A00990? ? fl_S2?  fl_S1L? 10.553300? 48.5492
764710? Eucgr.A01020?  fl_S1E?  fl_S1L?  0.000000? 57.9273
1273974 Eucgr.A01020? ? fl_S2?  fl_S1L?  0.000000? 57.9273
764756? Eucgr.A01073?  fl_S1E?  fl_S1L?  8.504710 101.1870
1274020 Eucgr.A01073? ? fl_S2?  fl_S1L?  5.400010 101.1870
764773? Eucgr.A01091?  fl_S1E?  fl_S1L?  3.448910? 15.7756
764826? Eucgr.A01152?  fl_S1E?  fl_S1L? 69.565700 198.2320
764831? Eucgr.A01158?  fl_S1E?  fl_S1L?  7.265640? 30.9565
764845? Eucgr.A01172?  fl_S1E?  fl_S1L?  3.248020? 16.9127
764927? Eucgr.A01269?  fl_S1E?  fl_S1L? 18.710200? 76.6918
", header = TRUE)

av <- ave(dat$FL_EARLY, dat$gene_id)
dat$FLY_EARLY <- av

Hope this helps,

Rui Barradas
Em 12-10-2012 16:41, Vining, Kelly escreveu:
Dear useRs,

I have a slightly complicated data structure and am stuck trying to extract what I need. I'm pasting an example of this data below. In some cases, there are duplicates in the "gene_id" column because there are two different "sample 1" values for a given "sample 2" value. Where these duplicates exist, I need to average the corresponding "FL_EARLY" values and retain the "FL_LATE" value and replace those two rows with a row containing the "FL_EARLY" average so that I no longer have any "gene_id" duplicates.

Seems like this is a job for some version of the apply function, but searching and puzzling over this has not gotten me anywhere. Any help will be much appreciated!

Example data:

? ? ? ? ? ? ?  gene_id sample_1 sample_2?  FL_EARLY? FL_LATE
763938? Eucgr.A00054?  fl_S1E?  fl_S1L? 13.170800? 22.2605
763979? Eucgr.A00101?  fl_S1E?  fl_S1L?  0.367960? 14.1202
1273243 Eucgr.A00101? ? fl_S2?  fl_S1L?  0.356625? 14.1202
764169? Eucgr.A00350?  fl_S1E?  fl_S1L?  7.381070? 43.9275
1273433 Eucgr.A00350? ? fl_S2?  fl_S1L? 10.674500? 43.9275
1273669 Eucgr.A00650? ? fl_S2?  fl_S1L? 33.669100? 50.0169
764480? Eucgr.A00744?  fl_S1E?  fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744? ? fl_S2?  fl_S1L 142.659000 747.2770
764595? Eucgr.A00890?  fl_S1E?  fl_S1L?  2.937760? 14.9647
764683? Eucgr.A00990?  fl_S1E?  fl_S1L?  8.681250? 48.5492
1273947 Eucgr.A00990? ? fl_S2?  fl_S1L? 10.553300? 48.5492
764710? Eucgr.A01020?  fl_S1E?  fl_S1L?  0.000000? 57.9273
1273974 Eucgr.A01020? ? fl_S2?  fl_S1L?  0.000000? 57.9273
764756? Eucgr.A01073?  fl_S1E?  fl_S1L?  8.504710 101.1870
1274020 Eucgr.A01073? ? fl_S2?  fl_S1L?  5.400010 101.1870
764773? Eucgr.A01091?  fl_S1E?  fl_S1L?  3.448910? 15.7756
764826? Eucgr.A01152?  fl_S1E?  fl_S1L? 69.565700 198.2320
764831? Eucgr.A01158?  fl_S1E?  fl_S1L?  7.265640? 30.9565
764845? Eucgr.A01172?  fl_S1E?  fl_S1L?  3.248020? 16.9127
764927? Eucgr.A01269?  fl_S1E?  fl_S1L? 18.710200? 76.6918

--Kelly V.

??? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hi
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of Pieter Schoonees
Sent: Friday, October 12, 2012 6:19 PM
To: Vining, Kelly; r-help at r-project.org
Subject: Re: [R] average duplicated rows?

You will have to split() the data and unsplit() it after making the
alterations. Have a look at the plyr package for such functions.

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of Vining, Kelly
Sent: Friday 12 October 2012 5:42
To: r-help at r-project.org
Subject: [R] average duplicated rows?

Dear useRs,

I have a slightly complicated data structure and am stuck trying to
extract what I need. I'm pasting an example of this data below. In
some cases, there are duplicates in the "gene_id" column because
there
are two different "sample 1" values for a given "sample 2" value.
Where these duplicates exist, I need to average the corresponding
"FL_EARLY"
values and retain the "FL_LATE" value and replace those two rows with
a row containing the "FL_EARLY" average so that I no longer have any
"gene_id" duplicates.

Seems like this is a job for some version of the apply function, but
searching and puzzling over this has not gotten me anywhere. Any help
will be much appreciated!
Aggregate is designed for that

ex.ag<-aggregate(ex[,c("FL_EARLY", "FL_LATE")], list(ex$gene_id), mean)

you will lose sample1 and 2 but you did not mention you want to retain them and how.

Regards
Petr
Example data:

             gene_id sample_1 sample_2   FL_EARLY  FL_LATE
763938  Eucgr.A00054   fl_S1E   fl_S1L  13.170800  22.2605
763979  Eucgr.A00101   fl_S1E   fl_S1L   0.367960  14.1202
1273243 Eucgr.A00101    fl_S2   fl_S1L   0.356625  14.1202
764169  Eucgr.A00350   fl_S1E   fl_S1L   7.381070  43.9275
1273433 Eucgr.A00350    fl_S2   fl_S1L  10.674500  43.9275
1273669 Eucgr.A00650    fl_S2   fl_S1L  33.669100  50.0169
764480  Eucgr.A00744   fl_S1E   fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744    fl_S2   fl_S1L 142.659000 747.2770
764595  Eucgr.A00890   fl_S1E   fl_S1L   2.937760  14.9647
764683  Eucgr.A00990   fl_S1E   fl_S1L   8.681250  48.5492
1273947 Eucgr.A00990    fl_S2   fl_S1L  10.553300  48.5492
764710  Eucgr.A01020   fl_S1E   fl_S1L   0.000000  57.9273
1273974 Eucgr.A01020    fl_S2   fl_S1L   0.000000  57.9273
764756  Eucgr.A01073   fl_S1E   fl_S1L   8.504710 101.1870
1274020 Eucgr.A01073    fl_S2   fl_S1L   5.400010 101.1870
764773  Eucgr.A01091   fl_S1E   fl_S1L   3.448910  15.7756
764826  Eucgr.A01152   fl_S1E   fl_S1L  69.565700 198.2320
764831  Eucgr.A01158   fl_S1E   fl_S1L   7.265640  30.9565
764845  Eucgr.A01172   fl_S1E   fl_S1L   3.248020  16.9127
764927  Eucgr.A01269   fl_S1E   fl_S1L  18.710200  76.6918

--Kelly V.

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html and provide commented, minimal, self-contained,
reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html
and provide commented, minimal, self-contained, reproducible code.
Thanks much! This was very helpful.

--Kelly

-----Original Message-----
From: arun [mailto:smartpink111 at yahoo.com] 
Sent: Friday, October 12, 2012 7:23 PM
To: Vining, Kelly
Cc: Rui Barradas; R help
Subject: Re: [R] average duplicated rows?

HI,

My earlier solutions averaged FL_EARLY values for duplicated "gene_ids" so that the resultant dataframe had unique rows.? But, if you want to keep the duplicated rows with average values, you can also try this:
dat$FL_EARLY<-unlist(lapply(lapply(split(dat,dat$gene_id),`[`,4),function(x) rep(colMeans(x),each=nrow(x))),use.names=F)
?head(dat)
#???????????? gene_id sample_1 sample_2?? FL_EARLY FL_LATE
#763938? Eucgr.A00054?? fl_S1E?? fl_S1L 13.1708000 22.2605
#763979? Eucgr.A00101?? fl_S1E?? fl_S1L? 0.3622925 14.1202
#1273243 Eucgr.A00101??? fl_S2?? fl_S1L? 0.3622925 14.1202
#764169? Eucgr.A00350?? fl_S1E?? fl_S1L? 9.0277850 43.9275
#1273433 Eucgr.A00350??? fl_S2?? fl_S1L? 9.0277850 43.9275
#1273669 Eucgr.A00650??? fl_S2?? fl_S1L 33.6691000 50.0169 A.K.

----- Original Message -----
From: Rui Barradas <ruipbarradas at sapo.pt>
To: "Vining, Kelly" <Kelly.Vining at oregonstate.edu>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Sent: Friday, October 12, 2012 1:10 PM
Subject: Re: [R] average duplicated rows?

Hello,

It could be a job for tapply, but I find it more suited for ?ave.

dat <- read.table(text = "
? gene_id sample_1 sample_2?  FL_EARLY? FL_LATE
763938? Eucgr.A00054?  fl_S1E?  fl_S1L? 13.170800? 22.2605
763979? Eucgr.A00101?  fl_S1E?  fl_S1L?  0.367960? 14.1202
1273243 Eucgr.A00101? ? fl_S2?  fl_S1L?  0.356625? 14.1202
764169? Eucgr.A00350?  fl_S1E?  fl_S1L?  7.381070? 43.9275
1273433 Eucgr.A00350? ? fl_S2?  fl_S1L? 10.674500? 43.9275
1273669 Eucgr.A00650? ? fl_S2?  fl_S1L? 33.669100? 50.0169
764480? Eucgr.A00744?  fl_S1E?  fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744? ? fl_S2?  fl_S1L 142.659000 747.2770
764595? Eucgr.A00890?  fl_S1E?  fl_S1L?  2.937760? 14.9647
764683? Eucgr.A00990?  fl_S1E?  fl_S1L?  8.681250? 48.5492
1273947 Eucgr.A00990? ? fl_S2?  fl_S1L? 10.553300? 48.5492
764710? Eucgr.A01020?  fl_S1E?  fl_S1L?  0.000000? 57.9273
1273974 Eucgr.A01020? ? fl_S2?  fl_S1L?  0.000000? 57.9273
764756? Eucgr.A01073?  fl_S1E?  fl_S1L?  8.504710 101.1870
1274020 Eucgr.A01073? ? fl_S2?  fl_S1L?  5.400010 101.1870
764773? Eucgr.A01091?  fl_S1E?  fl_S1L?  3.448910? 15.7756
764826? Eucgr.A01152?  fl_S1E?  fl_S1L? 69.565700 198.2320
764831? Eucgr.A01158?  fl_S1E?  fl_S1L?  7.265640? 30.9565
764845? Eucgr.A01172?  fl_S1E?  fl_S1L?  3.248020? 16.9127
764927? Eucgr.A01269?  fl_S1E?  fl_S1L? 18.710200? 76.6918 ", header = TRUE)

av <- ave(dat$FL_EARLY, dat$gene_id)
dat$FLY_EARLY <- av

Hope this helps,

Rui Barradas
Em 12-10-2012 16:41, Vining, Kelly escreveu:
Dear useRs,

I have a slightly complicated data structure and am stuck trying to extract what I need. I'm pasting an example of this data below. In some cases, there are duplicates in the "gene_id" column because there are two different "sample 1" values for a given "sample 2" value. Where these duplicates exist, I need to average the corresponding "FL_EARLY" values and retain the "FL_LATE" value and replace those two rows with a row containing the "FL_EARLY" average so that I no longer have any "gene_id" duplicates.
Seems like this is a job for some version of the apply function, but searching and puzzling over this has not gotten me anywhere. Any help will be much appreciated!

Example data:

? ? ? ? ? ? ?  gene_id sample_1 sample_2?  FL_EARLY? FL_LATE
763938? Eucgr.A00054?  fl_S1E?  fl_S1L? 13.170800? 22.2605
763979? Eucgr.A00101?  fl_S1E?  fl_S1L?  0.367960? 14.1202
1273243 Eucgr.A00101? ? fl_S2?  fl_S1L?  0.356625? 14.1202
764169? Eucgr.A00350?  fl_S1E?  fl_S1L?  7.381070? 43.9275
1273433 Eucgr.A00350? ? fl_S2?  fl_S1L? 10.674500? 43.9275
1273669 Eucgr.A00650? ? fl_S2?  fl_S1L? 33.669100? 50.0169
764480? Eucgr.A00744?  fl_S1E?  fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744? ? fl_S2?  fl_S1L 142.659000 747.2770
764595? Eucgr.A00890?  fl_S1E?  fl_S1L?  2.937760? 14.9647
764683? Eucgr.A00990?  fl_S1E?  fl_S1L?  8.681250? 48.5492
1273947 Eucgr.A00990? ? fl_S2?  fl_S1L? 10.553300? 48.5492
764710? Eucgr.A01020?  fl_S1E?  fl_S1L?  0.000000? 57.9273
1273974 Eucgr.A01020? ? fl_S2?  fl_S1L?  0.000000? 57.9273
764756? Eucgr.A01073?  fl_S1E?  fl_S1L?  8.504710 101.1870
1274020 Eucgr.A01073? ? fl_S2?  fl_S1L?  5.400010 101.1870
764773? Eucgr.A01091?  fl_S1E?  fl_S1L?  3.448910? 15.7756
764826? Eucgr.A01152?  fl_S1E?  fl_S1L? 69.565700 198.2320
764831? Eucgr.A01158?  fl_S1E?  fl_S1L?  7.265640? 30.9565
764845? Eucgr.A01172?  fl_S1E?  fl_S1L?  3.248020? 16.9127
764927? Eucgr.A01269?  fl_S1E?  fl_S1L? 18.710200? 76.6918

--Kelly V.

??? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.