-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of William Dunlap
Sent: Friday, November 05, 2010 9:58 AM
To: Changbin Du
Cc: r-help at r-project.org
Subject: Re: [R] how to work with long vectors
The following cover_per_3 uses sorting to solve
the problem more quickly. It still has room
for improvement.
cover_per_3 <- function (data)
{
n <- length(data)
o <- rev(order(data))
sdata <- data[o]
r <- rle(sdata)$lengths
output <- numeric(n)
output[o] <- rep(cumsum(r), r)
100 * output/n
}
(The ecdf function would probabably also do the
job quickly.)
When trying to work on problems like this I find
it most fruitful to work on smaller datasets and
see how the time grows with the size of the data,
instead of seeing how many days a it takes on a huge
dataset. E.g., the following compares times for
your original function, Phil Spector's simple cleanup
of your function, and the sort based approach for
vectors of length 5 and 10 thousand.
z<-sample(5e3, size=1e4, replace=TRUE) ; print(system.time(v <-
cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
print(system.time(v_3 <- cover_per_3(z)))
user system elapsed
38.21 0.00 38.41
user system elapsed
0.86 0.00 0.86
user system elapsed
0 0 0
z<-sample(1e4, size=2e4, replace=TRUE) ; print(system.time(v <-
cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
print(system.time(v_3 <- cover_per_3(z)))
user system elapsed
158.48 0.07 159.31
user system elapsed
3.23 0.00 3.25
user system elapsed
0.02 0.00 0.02
[1] TRUE
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Changbin Du
Sent: Friday, November 05, 2010 9:14 AM
To: Phil Spector
Cc: <r-help at r-project.org>
Subject: Re: [R] how to work with long vectors
HI, Phil,
I used the following codes and run it overnight for 15 hours,
this morning,
I stopped it. It seems it is still not efficient.
matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL
UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW>
ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
sep="\t", skip=0, header=F,fill=T) #
names(matt)<-c("id","reads")
cover_per_2 <- function(data){
+ l = length(data)
+ output = numeric(l)
+ for(i in 1:l)output[i] = sum(data >= data[i])
+ 100 * output / l
+ }
result3<-cover_per_2(cover)
On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du
<changbind at gmail.com> wrote:
Thanks Phil, that is great! I WILL try this and let you
On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector
<spector at stat.berkeley.edu>wrote:
Changbin -
Does
100 * sapply(matt$reads,function(x)sum(matt$reads >=
x))/length(matt$reads)
give what you want?
By the way, if you want to use a loop (there's nothing
that),
then try to avoid the most common mistake that people make
R:
having your result grow inside the loop. Here's a better
loop
to solve your problem:
cover_per_1 <- function(data){
l = length(data)
output = numeric(l)
for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1,
0))/length(data)
output
}
Using some random data, and comparing to your original
system.time(one <- cover_per(dat))
user system elapsed
0.816 0.000 0.824
system.time(two <- cover_per_1(dat))
user system elapsed
0.792 0.000 0.805
Not that big a speedup, but it does increase quite a bit
gets
larger.
There are two obvious ways to speed up your function:
1) Eliminate the ifelse function, since automatic
logical to numeric does the same thing.
2) Multiply by 100 and divide by the length outside the loop:
cover_per_2 <- function(data){
l = length(data)
output = numeric(l)
for(i in 1:l)output[i] = sum(data >= data[i])
100 * output / l
}
system.time(three <- cover_per_2(dat))
user system elapsed
0.024 0.000 0.027
That makes the loop just about equivalent to the sapply solution:
system.time(four <- 100*sapply(dat,function(x)sum(dat >=
user system elapsed
0.024 0.000 0.026
- Phil Spector
Statistical
Department of Statistics
UC Berkeley
spector at stat.berkeley.edu
On Thu, 4 Nov 2010, Changbin Du wrote:
HI, Dear R community,
I have one data set like this, What I want to do is to
cumulative coverage. The following codes works for small
=
100), but when feed the whole data set, it still running
Can someone give some suggestions for long vector?
id reads
Contig79:1 4
Contig79:2 8
Contig79:3 13
Contig79:4 14
Contig79:5 17
Contig79:6 20
Contig79:7 25
Contig79:8 27
Contig79:9 32
Contig79:10 33
Contig79:11 34
matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL
UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW>
ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
sep="\t", skip=0, header=F,fill=T) #
dim(matt)
[1] 3384766 2
matt_plot<-function(matt, outputfile) {
names(matt)<-c("id","reads")
cover<-matt$reads
#calculate the cumulative coverage.
+ cover_per<-function (data) {
+ output<-numeric(0)
+ for (i in data) {
+ x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
+ output<-c(output, x)
+ }
+ return(output)
+ }
result<-cover_per(cover)
Thanks so much!
--
Sincerely,
Changbin
--
[[alternative HTML version deleted]]