[Bioc-devel] Subject: GRanges performance issue, how to avoid looping?
Hi Jesper,
On 02/22/2012 04:15 AM, Jesper G?din wrote:
Hi everyone!
I want to find all reads that map to a specific position in a GRanges
object.
In my case this object is named "reads" and I have tried using the function
below.
#From a given position and chromosome, find the reads where this position
map
fun_find_reads_from_pos<- function(reads,chr,someposition, verbose=TRUE) {
#get length over the sample list
nr_samples<-length(reads)
for (j in (1:nr_samples)) {
onePerson<- reads[[j]]
chr_onePerson<- onePerson[seqnames(onePerson)==chr]
pre_seq<-
chr_onePerson[(ranges(chr_onePerson))%in%(IRanges(start=someposition,
width=1)),6]
if(!length(pre_seq)==0) {
for (k in (1:length(pre_seq))) {
position<- (someposition-start(pre_seq[k])) +1
#57537220
print("found the given someposition on this position within
the read")
print(position)
}
}
}
} #End of function
To use the function try:
fun_find_reads_from_pos(reads,"chr12",57537220)
We can't run this function because we don't know what 'reads' is. It looks like a list-like object (because you are using [[ on it), where each element could be a GRanges or GappedAlignments object (because you are using seqnames() and ranges() on those elements). Given the subject of your email, I'll assume it's a GRangesList object or a list of GRanges objects.
Now that should work. And everything would be fine if it wasnt for the time issue. The time-thief I guess is that I have to loop over every read in every sample to get to the information I want.
You are using 2 nested loops.
The outer loop is on the samples and I would expect the nb of samples
to be relatively small, so this loop is probably not an issue.
The inner loop is on the genomic ranges that actually overlap with
the specified position. It is probably expensive but you shouldn't
do that loop. Instead you should take advantage of the fact that
arithmetic is vectorized in R. So instead of:
if(!length(pre_seq)==0) {
for (k in (1:length(pre_seq))) {
position <- (someposition-start(pre_seq[k])) +1
print("found the given someposition on this position
within the read")
print(position)
}
}
do something like:
if (length(pre_seq) != 0L) {
position <- someposition - start(pre_seq) + 1L
cat("found the given someposition on those positions within
the read:\n")
print(position)
}
Now it's not clear to me why you want to print this but that's another
story.
Cheers,
H.
Is it possible to use the data structure in another better way to avoid unnecessary looping. Anyone have an idea? This function would be the core function of my future package. So its important that it is effective. Sincerely, Jesper Link to the reads object needed to run the function: http://uppsalanf.se/sites/default/files/reads_object.RData And the function itself (same as above): http://uppsalanf.se/sites/default/files/example-bioc-dev.R
sessionInfo()
R version 2.13.2 (2011-09-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.20.4 GenomicRanges_1.4.8 IRanges_1.10.6
?GRanges sessionInfo()
R version 2.13.2 (2011-09-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.20.4 GenomicRanges_1.4.8 IRanges_1.10.6 loaded via a namespace (and not attached): [1] tools_2.13.2 [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319