Skip to content

[Bioc-devel] writeVcf performance

9 messages · Valerie Obenchain, Michael Lawrence, Martin Morgan +2 more

#
Hi Michael,

I'm interested in working on this. I'll discuss with Martin next week 
when we're both back in the office.

Val
On 08/05/14 07:46, Michael Lawrence wrote:
20 days later
#
Val,

Has there been any movement on this? This remains a substantial bottleneck
for us when writing very large VCF files (e.g. variants+genotypes for whole
genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup of
~2x with 10-12 cores for a VCF with 500k rows  using a very naive
parallelization strategy and no other changes. I suspect this could be
improved on quite a bit, or possibly made irrelevant with judicious use of
serial C code.

Did you and Martin make any plans regarding optimizing writeVcf?

Best
~G


On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:

  
    
#
Hi Gabe,

Martin responded, and so did Michael,

https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

It sounded like Michael was ok with working with/around heap 
initialization.

Michael, is that right or should we still consider this on the table?


Val
On 08/26/2014 09:34 AM, Gabe Becker wrote:
#
My understanding is that the heap optimization provided marginal gains, and
that we need to think harder about how to optimize the all of the string
manipulation in writeVcf. We either need to reduce it or reduce its
overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.


On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:

  
  
#
I didn't see in the original thread a reproducible (simulated, I guess) example, 
to be explicit about what the problem is??

Martin
On 08/26/2014 10:47 AM, Michael Lawrence wrote:

  
    
#
Gabe is still testing/profiling, but we'll send something randomized along
eventually.
On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:

            

  
  
#
Martin and Val.

I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with
profiling enabled. The results of summaryRprof for that run are attached,
though for a variety of reasons they are pretty misleading.

It took over an hour to write (3700+seconds), so it's definitely a
bottleneck when the data get very large, even if it isn't for smaller data.

Michael and I both think the culprit is all the pasting and cbinding that
is going on, and more to the point, that memory for an internal
representation to be written out is allocated at all.  Streaming across the
object, looping by rows and writing directly to file (e.g. from C) should
be blisteringly fast in comparison.

~G


On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <michafla at gene.com>
wrote:

  
    
#
The profiling I attached in my previous email is for 24 geno fields, as I
said, but our typical usecase involves only ~4-6 fields, and is faster but
still on the order of dozens of minutes.

Sorry for the confusion.
~G
On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <beckerg4 at gene.com> wrote:

            

  
    
1 day later
#
Try to run it through the lineprof package for memory profiling; I have
found this to be very helpful.

Here is an old blog post I wrote about it
  http://www.hansenlab.org/rstats/2014/01/30/lineprof/

Kasper
On Wed, Aug 27, 2014 at 2:56 PM, Gabe Becker <becker.gabe at gene.com> wrote: