Ok, thanks - I misread the manual, I just thought that was a way to filter out reads falling over certain regions. After reading the alignments I use summarizeOverlaps to obtain counts with the desired options. I decided to use scanBamParam(which=..) because loading bams and counting overlaps is using a lot of RAM for big bams and I found online various suggestions to use the ScanBamParam to load one chr at a time...but as long as I would like to be general I decided to extract all the seqlevels in the GRanges object that I'm using and then I would like to iterate over those extracting the reads that fall on different seqlevels. I noticed that the counts where different with this approach with respect to doing all the work together. I guess that I have to find another way to split my alignments...but I'd really like to avoid something like: http://comments.gmane.org/gmane.comp.lang.r.sequencing/755 Reducing my features would not work as long as I need to keep the original number of the reads...I guess I will have to "flatten" all the regions falling on a single chr of my Granges in a way that avoid overlapping reads, that should do the trick (I hope...)...maybe I could extract the smaller-bigger coords for a given chr and obtain chr-based Granges from there. Thank you very much for your help, E.
[Bioc-devel] [BioC] readGappedAlignments and param argument
4 messages · Elena Grassi, Martin Morgan
On 09/10/2013 12:22 AM, Elena Grassi wrote:
Ok, thanks - I misread the manual, I just thought that was a way to filter out reads falling over certain regions. After reading the alignments I use summarizeOverlaps to obtain counts with the desired options. I decided to use scanBamParam(which=..)
You can use summarizeOverlaps with a 'BamFileList' created by something like
myFiles = dir("/some/dir", pattern="bam$")
bfl = BamFileList(myFiles, yieldSize=1000000)
olaps = summarizeOverlaps(features, bfl)
see the example on the help page
method?"summarizeOverlaps,GRanges,BamFileList"
(there's helpful tab completion after 'method?"sum<tab>'). If you've also said
library(parallel)
options(mc.cores=detectCores()) ## how many cores to use?
then the bam files are processed in parallel (choose yieldSize so that the
different threads are not competing for memory).
because loading bams and counting overlaps is using a lot of RAM for big bams and I found online various suggestions to use the
more generally a strategy is to use yieldSize and iterate through the file, like
bf = BamFile(fl, yieldsSize=1000000)
open(bf)
repeat {
curr = readBamGappedAlignments(bf) ## read up to yieldSize records
if (length(curr) == 0)
break ## done
## process chunk 'curr'
}
close(bf)
this is actually what summarizeOverlaps does when invoked on a BamFile (or
BamFileList().
ScanBamParam to load one chr at a time...but as long as I would like to be general I decided to extract all the seqlevels in the GRanges object that I'm using and then I would like to iterate over those extracting the reads that fall on different seqlevels. I noticed that the counts where different with this approach with respect to doing all the work together. I guess that I have to find another way to split my alignments...but I'd really like to avoid something like: http://comments.gmane.org/gmane.comp.lang.r.sequencing/755
one small thing that came out of that thread was that
as(seqinfo(BamFile("/some/file")), "GRanges")
gives a GRanges of all the sequence lengths.
Reducing my features would not work as long as I need to keep the original number of the reads...I guess I will have to "flatten" all the regions falling on a single chr of my Granges in a way that avoid overlapping reads, that should do the trick (I hope...)...maybe I could extract the smaller-bigger coords for a given chr and obtain chr-based Granges from there.
Thank you very much for your help, E.
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
You can use summarizeOverlaps with a 'BamFileList' created by something like
myFiles = dir("/some/dir", pattern="bam$")
bfl = BamFileList(myFiles, yieldSize=1000000)
olaps = summarizeOverlaps(features, bfl)
see the example on the help page
method?"summarizeOverlaps,GRanges,BamFileList"
I see, thanks. Right now I was looking for a solution "prior" to summarizeOverlaps due to the structure that I've given to my package - but I will re-think about it and check if in this way it works (I only have 20 unittest or so to rewrite, no worries for me :) ). Right now I've found a way to define the single chromosome GRanges that works with "which=..." and I would like to offer flexibility for people with small RAM...therefore running the whole analysis on one chr at a time seems a reasonable option all the same.
one small thing that came out of that thread was that
as(seqinfo(BamFile("/some/file")), "GRanges")
gives a GRanges of all the sequence lengths.
I saw that but I wanted to avoid having to do that for every bam and merge the results afterwards - working at the "annotation" level seemed more sensible to me. I guess that I should have asked my boss to have 1 month to scavenge around mans and vignettes before starting to write my first bioconductor-R package :) E.
On 09/10/2013 03:40 AM, Elena Grassi wrote:
You can use summarizeOverlaps with a 'BamFileList' created by something like
myFiles = dir("/some/dir", pattern="bam$")
bfl = BamFileList(myFiles, yieldSize=1000000)
olaps = summarizeOverlaps(features, bfl)
see the example on the help page
method?"summarizeOverlaps,GRanges,BamFileList"
I see, thanks. Right now I was looking for a solution "prior" to summarizeOverlaps due to the structure that I've given to my package - but I will re-think about it and check if in this way it works (I only have 20 unittest or so to rewrite, no worries for me :) ). Right now I've found a way to define the single chromosome GRanges that works with "which=..." and I would like to offer flexibility for people with small RAM...therefore running the whole analysis on one chr at a time seems a reasonable option all the same.
personally, I think it's better to iterate through the file using yieldSize
rather than 'by chromosome' (it would handle very large files, for instance,
where even a single chromosome has too many alignments to fit in memory).
The following function would be my second choice (not well-tested; hopefully a
version will appear in the next GenomicRanges), which divides seqlengths into a
GRangesList where each element contains ranges that add to approximately equal
sized widths. So in a big bam file you could just make more tiles.
Maybe the next stumbling block is collating results across tiles? The basic
strategy will be 'pre-allocate and fill'...
Martin
library(GenomicRanges)
library(Rsamtools)
library(RNAseqData.HNRNPC.bam.chr14)
yy <- seqlengths(BamFile(fl))
yy <- yy[grep("_", names(yy), invert=TRUE)]
tile(yy, 40)
tile <-
function(seqlengths, n)
{
if (is.null(names(seqlengths)))
stop("names(seqlengths) must not be NULL")
clen <- cumsum(as.numeric(seqlengths))
breaks0 <- ceiling(seq(1L, clen[length(clen)], length.out=n + 1L))
breaks <- sort(unique(c(clen, breaks0)))[-1]
grp <- cumsum(c(TRUE, breaks[-length(breaks)] %in% breaks0))
idx <- cumsum(c(TRUE, (breaks %in% clen)[-length(breaks)]))
splt <- split(breaks, names(seqlengths)[idx])
ends <- Map(`-`, splt, c(0, clen[-length(clen)]))
starts <- lapply(ends, function(x) c(0L, x[-length(x)]) + 1L)
grg <- GRanges(rep(names(ends), sapply(ends, length)),
IRanges(unlist(starts, use.names=FALSE),
unlist(ends, use.names=FALSE)),
seqlengths=seqlengths)
reduce(splitAsList(grg, grp))
}
one small thing that came out of that thread was that
as(seqinfo(BamFile("/some/file")), "GRanges")
gives a GRanges of all the sequence lengths.
I saw that but I wanted to avoid having to do that for every bam and merge the results afterwards - working at the "annotation" level seemed more sensible to me. I guess that I should have asked my boss to have 1 month to scavenge around mans and vignettes before starting to write my first bioconductor-R package :) E.
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793