[Bioc-devel] [BioC] readGappedAlignments and param argument - Bioc-devel

Tue, Sep 10, 2013 12:22 AM #

Ok, thanks - I misread the manual, I just thought that was a way to
filter out reads falling over certain regions.
After reading the alignments I use summarizeOverlaps to obtain counts
with the desired options. I decided to use scanBamParam(which=..)
because loading bams and counting overlaps is using a lot of RAM for
big bams and I found online various suggestions to use the
ScanBamParam to load one chr at a time...but as long as I would like
to be general I decided to extract all the seqlevels in the GRanges
object that I'm using and then I would like to iterate over those
extracting the reads that fall on different seqlevels.
I noticed that the counts where different with this approach with
respect to doing all the work together. I guess that I have to find
another way to split my alignments...but I'd really like to avoid
something like:
http://comments.gmane.org/gmane.comp.lang.r.sequencing/755
Reducing my features would not work as long as I need to keep the
original number of the reads...I guess I will have to "flatten" all
the regions falling on a single chr of my Granges in a way that avoid
overlapping reads, that should do the trick (I hope...)...maybe I
could extract the smaller-bigger coords for a given chr and obtain
chr-based Granges from there.

Thank you very much for your help,
E.

Martin Morgan

Tue, Sep 10, 2013 3:25 AM #

On 09/10/2013 12:22 AM, Elena Grassi wrote:

You can use summarizeOverlaps with a 'BamFileList' created by something like

   myFiles = dir("/some/dir", pattern="bam$")
   bfl = BamFileList(myFiles, yieldSize=1000000)
   olaps = summarizeOverlaps(features, bfl)

see the example on the help page

   method?"summarizeOverlaps,GRanges,BamFileList"

(there's helpful tab completion after 'method?"sum<tab>'). If you've also said

   library(parallel)
   options(mc.cores=detectCores())  ## how many cores to use?

then the bam files are processed in parallel (choose yieldSize so that the 
different threads are not competing for memory).

more generally a strategy is to use yieldSize and iterate through the file, like

   bf = BamFile(fl, yieldsSize=1000000)
   open(bf)
   repeat {
       curr = readBamGappedAlignments(bf) ## read up to yieldSize records
       if (length(curr) == 0)
           break ## done
       ## process chunk 'curr'
   }
   close(bf)

this is actually what summarizeOverlaps does when invoked on a BamFile (or 
BamFileList().

one small thing that came out of that thread was that

   as(seqinfo(BamFile("/some/file")), "GRanges")

gives a GRanges of all the sequence lengths.

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

Elena Grassi

Tue, Sep 10, 2013 3:40 AM #

I see, thanks. Right now I was looking for a solution "prior" to
summarizeOverlaps due to the structure that I've given to my package -
but I will re-think about it and
check if in this way it works (I only have 20 unittest or so to
rewrite, no worries for me :) ).

Right now I've found a way to define the single chromosome GRanges
that works with "which=..." and I would like to offer flexibility
for people with small RAM...therefore running the whole analysis on
one chr at a time seems a reasonable option all the same.

I saw that but I wanted to avoid having to do that for every bam and
merge the results afterwards - working at the "annotation" level
seemed more sensible to me.

I guess that I should have asked my boss to have 1 month to scavenge
around mans and vignettes before starting to write my first
bioconductor-R package :)
E.

Martin Morgan

Tue, Sep 10, 2013 4:05 AM #

On 09/10/2013 03:40 AM, Elena Grassi wrote:

personally, I think it's better to iterate through the file using yieldSize 
rather than 'by chromosome' (it would handle very large files, for instance, 
where even a single chromosome has too many alignments to fit in memory).

The following function would be my second choice (not well-tested; hopefully a 
version will appear in the next GenomicRanges), which divides seqlengths into a 
GRangesList where each element contains ranges that add to approximately equal 
sized widths. So in a big bam file you could just make more tiles.

Maybe the next stumbling block is collating results across tiles? The basic 
strategy will be 'pre-allocate and fill'...

Martin

library(GenomicRanges)
library(Rsamtools)
library(RNAseqData.HNRNPC.bam.chr14)
yy <- seqlengths(BamFile(fl))
yy <- yy[grep("_", names(yy), invert=TRUE)]
tile(yy, 40)


tile <-
     function(seqlengths, n)
{
     if (is.null(names(seqlengths)))
         stop("names(seqlengths) must not be NULL")

     clen <- cumsum(as.numeric(seqlengths))
     breaks0 <- ceiling(seq(1L, clen[length(clen)], length.out=n + 1L))
     breaks <- sort(unique(c(clen, breaks0)))[-1]
     grp <- cumsum(c(TRUE, breaks[-length(breaks)] %in% breaks0))

     idx <- cumsum(c(TRUE, (breaks %in% clen)[-length(breaks)]))
     splt <- split(breaks, names(seqlengths)[idx])
     ends <- Map(`-`, splt, c(0, clen[-length(clen)]))
     starts <- lapply(ends, function(x) c(0L, x[-length(x)]) + 1L)

     grg <- GRanges(rep(names(ends), sapply(ends, length)),
                    IRanges(unlist(starts, use.names=FALSE),
                            unlist(ends, use.names=FALSE)),
                    seqlengths=seqlengths)
     reduce(splitAsList(grg, grp))
}

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793