[Bioc-devel] Slow performance on scanBam

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140512/465a6eed/attachment.pl>
Hi James -- I don't think there's anything in existence to make this easier, but 
I'll expose something in the next 24 hours; is your data shareable? There might 
be deeper things to be done for processing this small-but-numerous style data.

Martin
I've been dealing with some small bam files with millions of reference
sequences leading to monster headers. As one might guess, this leads to
pretty bad performance when calling scanBam.

Right now, it takes a bit (27MB bam file, 16k alignments, 2.5 million
reference sequences in the reference fasta file):

scanBam("sim.fasta-L27-ma8-mp6-rfg5_2-rdg3_1.bam")
    user  system elapsed
186.264   0.528 186.934

I've traced it down to scanBamHeader and seqinfo-BamFile, the problematic
code is in scanBamHeader which processes the entire header when all seqinfo
needs is the `targets` portion of the list. Additionally, the
order(rankSeqLevels(.)) doesn't scale either. So I've replaced this as
well. I've changed the body of seqinfo-BamFile from:

     h <- scanBamHeader(x)[["targets"]]
     o <- order(rankSeqlevels(names(h)))
     Seqinfo(names(h)[o], unname(h)[o])

to this:

     if (!isOpen(x)) {
         open(x)
         on.exit(close(x))
     }
     h <- .Call(.read_bamfile_header, .extptr(x))$targets
     Seqinfo(names(h), unname(h))

We then get:

scanBam("sim.fasta-L27-ma8-mp6-rfg5_2-rdg3_1.bam")
    user  system elapsed
  14.780   0.360  15.158

Which is still pretty slow for how small these files are, but I can
probably live with that.

Two questions:
-- do we need the: order(rankSeqlevels(names(h))) bit? that does change the
return value, but I can certainly live with the ordering in the file.
-- what else am I missing?

I can send a patch if need be, but would like to hear from the cognoscenti
first if there is a "built-in" way to avoid this overhead.

thanks, jim

	[[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140513/6f3dac2f/attachment.pl>
Hi Martin, thanks for the quick response. The data is certainly shareable. Here
is a link to a bam + bai + sam file that I have been using for benchmarking:
https://www.dropbox.com/s/eat31mnmmco1zoh/example-bam.tar.bz2

There is a method in SAM to elide the reference names from a header, but I think
they are just shunted to another file so I gave up on that track. Since I only
end up aligning to a small fraction of the transcripts, I might be able to
post-process the file, but it would be best to process as-is.
Hi Jim -- I updated the seqinfo,BamFile-method to do more work in C, and for 
scanBamHeader to optionally parse only the targets|text part of the header. I 
also reverted a change to seqinfo,BamFile-method, introduced in Rsamtools 
version 1.15.28, to try to place seqlevels into 'natural' order; now they are 
returned in the order they appear in the file.

Together these should make for much faster code, for your sim.bam about 3.5 (vs 
185) seconds for seqinfo, and ~7s for scanBam.

This is in Rsamtools version 1.17.16, which is in svn now but won't make it to 
biocLite until tomorrow, all being well...

Martin
thanks again, jim

On Tue, May 13, 2014 at 5:16 AM, Martin Morgan <mtmorgan at fhcrc.org
<mailto:mtmorgan at fhcrc.org>> wrote:

    Hi James -- I don't think there's anything in existence to make this easier,
    but I'll expose something in the next 24 hours; is your data shareable?
    There might be deeper things to be done for processing this
    small-but-numerous style data.

    Martin

    On 05/12/2014 05:32 PM, James Bullard wrote:

        I've been dealing with some small bam files with millions of reference
        sequences leading to monster headers. As one might guess, this leads to
        pretty bad performance when calling scanBam.

        Right now, it takes a bit (27MB bam file, 16k alignments, 2.5 million
        reference sequences in the reference fasta file):

            scanBam("sim.fasta-L27-ma8-__mp6-rfg5_2-rdg3_1.bam")

             user  system elapsed
        186.264   0.528 186.934

        I've traced it down to scanBamHeader and seqinfo-BamFile, the problematic
        code is in scanBamHeader which processes the entire header when all seqinfo
        needs is the `targets` portion of the list. Additionally, the
        order(rankSeqLevels(.)) doesn't scale either. So I've replaced this as
        well. I've changed the body of seqinfo-BamFile from:

              h <- scanBamHeader(x)[["targets"]]
              o <- order(rankSeqlevels(names(h)))
              Seqinfo(names(h)[o], unname(h)[o])

        to this:

              if (!isOpen(x)) {
                  open(x)
                  on.exit(close(x))
              }
              h <- .Call(.read_bamfile_header, .extptr(x))$targets
              Seqinfo(names(h), unname(h))

        We then get:

            scanBam("sim.fasta-L27-ma8-__mp6-rfg5_2-rdg3_1.bam")

             user  system elapsed
           14.780   0.360  15.158

        Which is still pretty slow for how small these files are, but I can
        probably live with that.

        Two questions:
        -- do we need the: order(rankSeqlevels(names(h))) bit? that does change the
        return value, but I can certainly live with the ordering in the file.
        -- what else am I missing?

        I can send a patch if need be, but would like to hear from the cognoscenti
        first if there is a "built-in" way to avoid this overhead.

        thanks, jim

                 [[alternative HTML version deleted]]

        _________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

    --
    Computational Biology / Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N.
    PO Box 19024 Seattle, WA 98109

    Location: Arnold Building M1 B861
    Phone: (206) 667-2793 <tel:%28206%29%20667-2793>

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140520/54f05b74/attachment.pl>