[Bioc-devel] writeVcf performance
Try to run it through the lineprof package for memory profiling; I have found this to be very helpful. Here is an old blog post I wrote about it http://www.hansenlab.org/rstats/2014/01/30/lineprof/ Kasper
On Wed, Aug 27, 2014 at 2:56 PM, Gabe Becker <becker.gabe at gene.com> wrote:
The profiling I attached in my previous email is for 24 geno fields, as I said, but our typical usecase involves only ~4-6 fields, and is faster but still on the order of dozens of minutes. Sorry for the confusion. ~G On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <beckerg4 at gene.com> wrote:
Martin and Val. I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with profiling enabled. The results of summaryRprof for that run are attached, though for a variety of reasons they are pretty misleading. It took over an hour to write (3700+seconds), so it's definitely a bottleneck when the data get very large, even if it isn't for smaller
data.
Michael and I both think the culprit is all the pasting and cbinding that is going on, and more to the point, that memory for an internal representation to be written out is allocated at all. Streaming across
the
object, looping by rows and writing directly to file (e.g. from C) should be blisteringly fast in comparison. ~G On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <michafla at gene.com> wrote:
Gabe is still testing/profiling, but we'll send something randomized along eventually. On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
I didn't see in the original thread a reproducible (simulated, I guess) example, to be explicit about what the problem is?? Martin On 08/26/2014 10:47 AM, Michael Lawrence wrote:
My understanding is that the heap optimization provided marginal
gains,
and that we need to think harder about how to optimize the all of the
string
manipulation in writeVcf. We either need to reduce it or reduce its overhead (i.e., the CHARSXP allocation). Gabe is doing more tests. On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <
vobencha at fhcrc.org>
wrote: Hi Gabe,
Martin responded, and so did Michael, https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html It sounded like Michael was ok with working with/around heap initialization. Michael, is that right or should we still consider this on the table? Val On 08/26/2014 09:34 AM, Gabe Becker wrote: Val,
Has there been any movement on this? This remains a substantial bottleneck for us when writing very large VCF files (e.g. variants+genotypes for whole genome NGS samples). I was able to see a ~25% speedup with 4 cores and an "optimal" speedup of ~2x with 10-12 cores for a VCF with 500k rows using a very naive parallelization strategy and no other changes. I suspect this could
be
improved on quite a bit, or possibly made irrelevant with judicious use of serial C code. Did you and Martin make any plans regarding optimizing writeVcf? Best ~G On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <
vobencha at fhcrc.org
<mailto:vobencha at fhcrc.org>> wrote:
Hi Michael,
I'm interested in working on this. I'll discuss with Martin
next
week when we're both back in the office.
Val
On 08/05/14 07:46, Michael Lawrence wrote:
Hi guys (Val, Martin, Herve):
Anyone have an itch for optimization? The writeVcf function
is
currently a
bottleneck in our WGS genotyping pipeline. For a typical 50
million row
gVCF, it was taking 2.25 hours prior to yesterday's
improvements
(pasteCollapseRows) that brought it down to about 1 hour,
which
is still
too long by my standards (> 0). Only takes 3 minutes to
call
the
genotypes
(and associated likelihoods etc) from the variant calls
(using
80 cores and
450 GB RAM on one node), so the output is an issue.
Profiling
suggests that
the running time scales non-linearly in the number of rows.
Digging a little deeper, it seems to be something with R's
string/memory
allocation. Below, pasting 1 million strings takes 6
seconds, but
10
million strings takes over 2 minutes. It gets way worse
with
50
million. I
suspect it has something to do with R's string hash table.
set.seed(1000)
end <- sample(1e8, 1e6)
system.time(paste0("END", "=", end))
user system elapsed
6.396 0.028 6.420
end <- sample(1e8, 1e7)
system.time(paste0("END", "=", end))
user system elapsed
134.714 0.352 134.978
Indeed, even this takes a long time (in a fresh session):
set.seed(1000)
end <- sample(1e8, 1e6)
end <- sample(1e8, 1e7)
system.time(as.character(end))
user system elapsed
57.224 0.156 57.366
But running it a second time is faster (about what one
would
expect?):
system.time(levels <- as.character(end))
user system elapsed
23.582 0.021 23.589
I did some simple profiling of R to find that the resizing
of
the string
hash table is not a significant component of the time. So
maybe
something
to do with the R heap/gc? No time right now to go deeper.
But I
know Martin
likes this sort of thing ;)
Michael
[[alternative HTML version deleted]]
_________________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
_________________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing
list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
--
Computational Biologist
Genentech Research
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
-- Computational Biologist Genentech Research
--
Computational Biologist
Genentech Research
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel