Skip to content

[Bioc-devel] GenomicAlignments: using asMates=TRUE and yieldSize with paired-end BAM files

8 messages · Michael Love, Valerie Obenchain, Hervé Pagès

#
Hi Mike,

You no longer need to sort Bam files to use the pairing algo or 
yieldSize. The readGAlignment* functions now work with both constraints 
out of the box.

Create a BamFile with yieldSize and indicate you want mates.
bf <- BamFile(fl, yieldSize=10000, asMates=TRUE)

Maybe set some specifications in a param:
param <- ScanBamParam(what = c("qname", "flag"))

Then call either readGAlignment* method that handles pairs:
readGAlignmentsList(bf, param=param)
readGAlignmentPairs(bf, param=param)

For summarizeOverlaps():
summarizeOverlaps(annotation, bf, param=param, singleEnd=FALSE)

We've considered removing the 'obeyQname' arg and documentation but 
thought the concept may be useful in another application. I'll revisit 
the summarizeOverlaps() documentation to make sure 'obeyQname' is 
downplayed and 'asMates' is encouraged.

Valerie
On 03/19/14 07:39, Michael Love wrote:
#
On 03/19/14 10:24, Michael Love wrote:
Sorry, our emails keep criss-crossing.

Because the mate-pairing is now done in C yieldSize is no longer a 
constraint.

When yieldSize is given, say 100, then 100 mates are returned. The algo 
moves through the file until 100 records are successfully paired. These 
100 are yielded to the user and the code picks up where it left off. A 
related situation is the 'which' in a param. In this case you want the 
mates in a particular range. The algo moves through the range and pairs 
what it can. If it's left with unmated records it goes outside the range 
looking for them. readGAlignmentsList() will return all records in this 
range, mated or not. The metadata column of 'mate_status' indicates the 
different groupings.

Valerie
2 days later
5 days later
#
Hi Mike,

This is fixed in Rsamtools 1.15.35.

The bug was related to when the mate-pairing was performed wrt meeting 
the 'yieldSize' requirement. Thanks for sending the file and 
reproducible example.

The file has ~115 million records:

fl <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
To process the complete file with a yield size of 1e6 took ~ 18 GIG and 
25 minutes. (ubuntu server, 16 processors, 387 GIG of ram)

bf <- BamFile(fl, yieldSize=1000000, asMates=TRUE)
grl <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene")
SO <- function(x)
     summarizeOverlaps(grl, x, ignore.strand=TRUE, singleEnd=FALSE)
Thanks for reporting the bug.

Valerie
On 03/21/14 13:55, Michael Love wrote:
#
I should also mention that when both 'yieldSize' in the BamFile and 
'which' in ScanBamParam are set the 'which' gets priority. The point of 
'yieldSize' is to provide a reasonable chunk for processing the file. 
When 'which' is provided it's assumed that range is of reasonable chunk 
size so the yield is ignored.

I've added this info to the 'summarizeOverlaps' and 'readGAlignments' 
man pages in GenomicAlignments.

Valerie
On 03/27/14 08:30, Valerie Obenchain wrote:
#
Hi Val,
On 03/27/2014 09:13 AM, Valerie Obenchain wrote:
Note that more than 1 range can be specified thru 'which'.

What about emitting a warning when 'yieldSize' is ignored?
Is this a property of scanBam()? If so then maybe it should be
documented in the man page for scanBam(). I just had a quick look
and was not able to find it, sorry if I missed it.

summarizeOverlaps(), readGAlignments(), and probably other tools,
just rely on scanBam().

Thanks,
H.

  
    
1 day later
#
Hi Herve,

I must retract my previous statement about 'yieldSize' and 'which'. As 
of Rsamtools 1.15.0, scanBam() (and functions that build on it) does 
handle the case where both are supplied. This is true for the non-mate 
and mate-pairing code.
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
bf <- BamFile(fl, yieldSize=1000)
which <- tileGenome(seqlengths(bf),
      tilewidth=500, cut.last.tile.in.chrom=TRUE)
param <- ScanBamParam(which=which, what="qname")
FUN <- function(elt) length(elt[[1]])

Here we have both 'yieldSize' and a 'which' in the param. We ask for a 
yield of 1000 records. The first range only has 394, the second has 570 
and from the third we get 570. As explained in the man page snippit 
above, records are yielded in complete ranges whose sum first exceeds 
'yieldSize'. In range 3 we exceed the 1000 so we get all of range 3 then 
stop.

sapply(scanBam(bf, param=param), FUN)
We can open the file and yield through all records:

bf <- open(BamFile(fl, yieldSize=1000))
sapply(scanBam(bf, param=param), FUN)
I've removed the misinformation from the man pages I altered. Also added 
a unit test for the mates code with 'yieldSize' and 'which' in Rsamtools.

Val
On 03/27/2014 11:36 AM, Herv? Pag?s wrote: