[Bioc-devel] Subject: GRanges performance issue, how to avoid looping?

Hi Jesper,
Hi everyone!
I want to find all reads that map to a specific position in a GRanges
object.
In my case this object is named "reads" and I have tried using the function
below.

#From a given position and chromosome, find the reads where this position
map
fun_find_reads_from_pos<- function(reads,chr,someposition, verbose=TRUE) {

     #get length over the sample list
     nr_samples<-length(reads)

     for (j in (1:nr_samples)) {
         onePerson<- reads[[j]]
         chr_onePerson<- onePerson[seqnames(onePerson)==chr]
         pre_seq<-
chr_onePerson[(ranges(chr_onePerson))%in%(IRanges(start=someposition,
width=1)),6]

         if(!length(pre_seq)==0) {
             for (k in (1:length(pre_seq))) {
                 position<- (someposition-start(pre_seq[k])) +1
#57537220
                 print("found the given someposition on this position within
the read")
                 print(position)
             }
         }
     }

} #End of function

To use the function try:
fun_find_reads_from_pos(reads,"chr12",57537220)
We can't run this function because we don't know what 'reads' is.
It looks like a list-like object (because you are using [[ on it),
where each element could be a GRanges or GappedAlignments object
(because you are using seqnames() and ranges() on those elements).
Given the subject of your email, I'll assume it's a GRangesList
object or a list of GRanges objects.
Now that should work.
And everything would be fine if it wasnt for the time issue.
The time-thief I guess is that I have to loop over every read
in every sample to get to the information I want.
You are using 2 nested loops.
The outer loop is on the samples and I would expect the nb of samples
to be relatively small, so this loop is probably not an issue.
The inner loop is on the genomic ranges that actually overlap with
the specified position. It is probably expensive but you shouldn't
do that loop. Instead you should take advantage of the fact that
arithmetic is vectorized in R. So instead of:

         if(!length(pre_seq)==0) {
             for (k in (1:length(pre_seq))) {
                 position <- (someposition-start(pre_seq[k])) +1
                 print("found the given someposition on this position 
within the read")
                 print(position)
             }
         }

do something like:

         if (length(pre_seq) != 0L) {
             position <- someposition - start(pre_seq) + 1L
             cat("found the given someposition on those positions within 
the read:\n")
             print(position)
         }

Now it's not clear to me why you want to print this but that's another
story.

Cheers,
H.
Is it
possible to use the data structure in another better way to
avoid unnecessary looping. Anyone have an idea?

This function would be the core function of my future package.
So its important that it is effective.

Sincerely,
Jesper

Link to the reads object needed to run the function:
http://uppsalanf.se/sites/default/files/reads_object.RData
And the function itself (same as above):
http://uppsalanf.se/sites/default/files/example-bioc-dev.R

sessionInfo()
R version 2.13.2 (2011-09-30)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biostrings_2.20.4   GenomicRanges_1.4.8 IRanges_1.10.6
?GRanges
sessionInfo()
R version 2.13.2 (2011-09-30)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biostrings_2.20.4   GenomicRanges_1.4.8 IRanges_1.10.6

loaded via a namespace (and not attached):
[1] tools_2.13.2

	[[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

[Bioc-devel] Subject: GRanges performance issue, how to avoid looping?

Thread (7 messages)