[Bioc-devel] appending 2 GappedAlignments using "c" takes long

Hej Bioc Core!

There was some discussion last year about implementing a BamStreamer (? la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.

".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){

 ## create a stream
 stopifnot(is(bamFile,"BamFile"))

 ## set the yieldSize if it is not set already
 if(is.na(yieldSize(bamFile))){
   yieldSize(bamFile) <- yieldSize
 }

 ## open it
 open(bamFile)

 ## verb
 if(verbose){
   message(paste("Streaming",basename(path(bamFile))))
 }

 ## create the output
 out <- GappedAlignments()

 ## process it
 while(length(chunk <- readBamGappedAlignments(bamFile))){
   if(verbose){
     message(paste("Processed",length(chunk),"reads"))
   }
   out <- c(out,chunk)
 }

 ## close
 close(bamFile)

 ## return
 return(out)
}

In the method above, the first iteration of combining the GappedAlignments:

out <- c(out,chunk) takes:

system.time(append(out,chunk))

  user  system elapsed 
123.704   0.060 124.011 

whereas the second iteration (faked here) takes only (still long):

system.time(append(chunk,chunk))

  user  system elapsed 
 2.708   0.044   2.758

I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration. Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below:

out <- GappedAlignments(seqlengths=seqlengths(chunk))

Error in GappedAlignments(seqlengths = seqlengths(chunk)) : 
 'names(seqlengths)' incompatible with 'levels(seqnames)'

out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk))

Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) : 
 'strand' must be specified when 'seqnames' is not empty

out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+")

Error in validObject(.Object) : 
 invalid class ?GappedAlignments? object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle"
invalid class ?GappedAlignments? object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x'

I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem.

My sessionInfo:

R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
[3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
[5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
[7] LC_PAPER=C                 LC_NAME=C                 
[9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] BiocInstaller_1.11.3  Rsamtools_1.13.22     Biostrings_2.29.12   
[4] GenomicRanges_1.13.26 XVector_0.1.0         IRanges_1.19.15      
[7] BiocGenerics_0.7.2   

loaded via a namespace (and not attached):
[1] bitops_1.0-5   stats4_3.0.1   zlibbioc_1.7.0

Cheers,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany