[Bioc-devel] [BioC] readGappedAlignments and param argument
On 09/10/2013 12:22 AM, Elena Grassi wrote:
Ok, thanks - I misread the manual, I just thought that was a way to filter out reads falling over certain regions. After reading the alignments I use summarizeOverlaps to obtain counts with the desired options. I decided to use scanBamParam(which=..)
You can use summarizeOverlaps with a 'BamFileList' created by something like
myFiles = dir("/some/dir", pattern="bam$")
bfl = BamFileList(myFiles, yieldSize=1000000)
olaps = summarizeOverlaps(features, bfl)
see the example on the help page
method?"summarizeOverlaps,GRanges,BamFileList"
(there's helpful tab completion after 'method?"sum<tab>'). If you've also said
library(parallel)
options(mc.cores=detectCores()) ## how many cores to use?
then the bam files are processed in parallel (choose yieldSize so that the
different threads are not competing for memory).
because loading bams and counting overlaps is using a lot of RAM for big bams and I found online various suggestions to use the
more generally a strategy is to use yieldSize and iterate through the file, like
bf = BamFile(fl, yieldsSize=1000000)
open(bf)
repeat {
curr = readBamGappedAlignments(bf) ## read up to yieldSize records
if (length(curr) == 0)
break ## done
## process chunk 'curr'
}
close(bf)
this is actually what summarizeOverlaps does when invoked on a BamFile (or
BamFileList().
ScanBamParam to load one chr at a time...but as long as I would like to be general I decided to extract all the seqlevels in the GRanges object that I'm using and then I would like to iterate over those extracting the reads that fall on different seqlevels. I noticed that the counts where different with this approach with respect to doing all the work together. I guess that I have to find another way to split my alignments...but I'd really like to avoid something like: http://comments.gmane.org/gmane.comp.lang.r.sequencing/755
one small thing that came out of that thread was that
as(seqinfo(BamFile("/some/file")), "GRanges")
gives a GRanges of all the sequence lengths.
Reducing my features would not work as long as I need to keep the original number of the reads...I guess I will have to "flatten" all the regions falling on a single chr of my Granges in a way that avoid overlapping reads, that should do the trick (I hope...)...maybe I could extract the smaller-bigger coords for a given chr and obtain chr-based Granges from there.
Thank you very much for your help, E.
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793