[Bioc-devel] writeVcf performance

Hi Michael,

I'm interested in working on this. I'll discuss with Martin next week 
when we're both back in the office.

Val
Hi guys (Val, Martin, Herve):

Anyone have an itch for optimization? The writeVcf function is currently a
bottleneck in our WGS genotyping pipeline. For a typical 50 million row
gVCF, it was taking 2.25 hours prior to yesterday's improvements
(pasteCollapseRows) that brought it down to about 1 hour, which is still
too long by my standards (> 0). Only takes 3 minutes to call the genotypes
(and associated likelihoods etc) from the variant calls (using 80 cores and
450 GB RAM on one node), so the output is an issue. Profiling suggests that
the running time scales non-linearly in the number of rows.

Digging a little deeper, it seems to be something with R's string/memory
allocation. Below, pasting 1 million strings takes 6 seconds, but 10
million strings takes over 2 minutes. It gets way worse with 50 million. I
suspect it has something to do with R's string hash table.

set.seed(1000)
end <- sample(1e8, 1e6)
system.time(paste0("END", "=", end))
    user  system elapsed
   6.396   0.028   6.420

end <- sample(1e8, 1e7)
system.time(paste0("END", "=", end))
    user  system elapsed
134.714   0.352 134.978

Indeed, even this takes a long time (in a fresh session):

set.seed(1000)
end <- sample(1e8, 1e6)
end <- sample(1e8, 1e7)
system.time(as.character(end))
    user  system elapsed
  57.224   0.156  57.366

But running it a second time is faster (about what one would expect?):

system.time(levels <- as.character(end))
    user  system elapsed
  23.582   0.021  23.589

I did some simple profiling of R to find that the resizing of the string
hash table is not a significant component of the time. So maybe something
to do with the R heap/gc? No time right now to go deeper. But I know Martin
likes this sort of thing ;)

Michael

	[[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Val,

Has there been any movement on this? This remains a substantial bottleneck
for us when writing very large VCF files (e.g. variants+genotypes for whole
genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup of
~2x with 10-12 cores for a VCF with 500k rows  using a very naive
parallelization strategy and no other changes. I suspect this could be
improved on quite a bit, or possibly made irrelevant with judicious use of
serial C code.

Did you and Martin make any plans regarding optimizing writeVcf?

Best
~G

On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:
Hi Michael,

I'm interested in working on this. I'll discuss with Martin next week when
we're both back in the office.

Val

On 08/05/14 07:46, Michael Lawrence wrote:

Hi guys (Val, Martin, Herve):

Anyone have an itch for optimization? The writeVcf function is currently a
bottleneck in our WGS genotyping pipeline. For a typical 50 million row
gVCF, it was taking 2.25 hours prior to yesterday's improvements
(pasteCollapseRows) that brought it down to about 1 hour, which is still
too long by my standards (> 0). Only takes 3 minutes to call the genotypes
(and associated likelihoods etc) from the variant calls (using 80 cores
and
450 GB RAM on one node), so the output is an issue. Profiling suggests
that
the running time scales non-linearly in the number of rows.

Digging a little deeper, it seems to be something with R's string/memory
allocation. Below, pasting 1 million strings takes 6 seconds, but 10
million strings takes over 2 minutes. It gets way worse with 50 million. I
suspect it has something to do with R's string hash table.

set.seed(1000)
end <- sample(1e8, 1e6)
system.time(paste0("END", "=", end))
    user  system elapsed
   6.396   0.028   6.420

end <- sample(1e8, 1e7)
system.time(paste0("END", "=", end))
    user  system elapsed
134.714   0.352 134.978

Indeed, even this takes a long time (in a fresh session):

set.seed(1000)
end <- sample(1e8, 1e6)
end <- sample(1e8, 1e7)
system.time(as.character(end))
    user  system elapsed
  57.224   0.156  57.366

But running it a second time is faster (about what one would expect?):

system.time(levels <- as.character(end))
    user  system elapsed
  23.582   0.021  23.589

I did some simple profiling of R to find that the resizing of the string
hash table is not a significant component of the time. So maybe something
to do with the R heap/gc? No time right now to go deeper. But I know
Martin
likes this sort of thing ;)

Michael

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Computational Biologist
Genentech Research

	[[alternative HTML version deleted]]
Hi Gabe,

Martin responded, and so did Michael,

https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

It sounded like Michael was ok with working with/around heap 
initialization.

Michael, is that right or should we still consider this on the table?

Val
Val,

Has there been any movement on this? This remains a substantial
bottleneck for us when writing very large VCF files (e.g.
variants+genotypes for whole genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup
of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
parallelization strategy and no other changes. I suspect this could be
improved on quite a bit, or possibly made irrelevant with judicious use
of serial C code.

Did you and Martin make any plans regarding optimizing writeVcf?

Best
~G

On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org
<mailto:vobencha at fhcrc.org>> wrote:

    Hi Michael,

    I'm interested in working on this. I'll discuss with Martin next
    week when we're both back in the office.

    Val

    On 08/05/14 07:46, Michael Lawrence wrote:

        Hi guys (Val, Martin, Herve):

        Anyone have an itch for optimization? The writeVcf function is
        currently a
        bottleneck in our WGS genotyping pipeline. For a typical 50
        million row
        gVCF, it was taking 2.25 hours prior to yesterday's improvements
        (pasteCollapseRows) that brought it down to about 1 hour, which
        is still
        too long by my standards (> 0). Only takes 3 minutes to call the
        genotypes
        (and associated likelihoods etc) from the variant calls (using
        80 cores and
        450 GB RAM on one node), so the output is an issue. Profiling
        suggests that
        the running time scales non-linearly in the number of rows.

        Digging a little deeper, it seems to be something with R's
        string/memory
        allocation. Below, pasting 1 million strings takes 6 seconds, but 10
        million strings takes over 2 minutes. It gets way worse with 50
        million. I
        suspect it has something to do with R's string hash table.

        set.seed(1000)
        end <- sample(1e8, 1e6)
        system.time(paste0("END", "=", end))
             user  system elapsed
            6.396   0.028   6.420

        end <- sample(1e8, 1e7)
        system.time(paste0("END", "=", end))
             user  system elapsed
        134.714   0.352 134.978

        Indeed, even this takes a long time (in a fresh session):

        set.seed(1000)
        end <- sample(1e8, 1e6)
        end <- sample(1e8, 1e7)
        system.time(as.character(end))
             user  system elapsed
           57.224   0.156  57.366

        But running it a second time is faster (about what one would
        expect?):

        system.time(levels <- as.character(end))
             user  system elapsed
           23.582   0.021  23.589

        I did some simple profiling of R to find that the resizing of
        the string
        hash table is not a significant component of the time. So maybe
        something
        to do with the R heap/gc? No time right now to go deeper. But I
        know Martin
        likes this sort of thing ;)

        Michael

                 [[alternative HTML version deleted]]

        _________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

    _________________________________________________
    Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
    https://stat.ethz.ch/mailman/__listinfo/bioc-devel
    <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

--
Computational Biologist
Genentech Research
My understanding is that the heap optimization provided marginal gains, and
that we need to think harder about how to optimize the all of the string
manipulation in writeVcf. We either need to reduce it or reduce its
overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.

On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:
Hi Gabe,

Martin responded, and so did Michael,

https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

It sounded like Michael was ok with working with/around heap
initialization.

Michael, is that right or should we still consider this on the table?

Val

On 08/26/2014 09:34 AM, Gabe Becker wrote:

Val,

Has there been any movement on this? This remains a substantial
bottleneck for us when writing very large VCF files (e.g.
variants+genotypes for whole genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup
of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
parallelization strategy and no other changes. I suspect this could be
improved on quite a bit, or possibly made irrelevant with judicious use
of serial C code.

Did you and Martin make any plans regarding optimizing writeVcf?

Best
~G

On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org
<mailto:vobencha at fhcrc.org>> wrote:

    Hi Michael,

    I'm interested in working on this. I'll discuss with Martin next
    week when we're both back in the office.

    Val

    On 08/05/14 07:46, Michael Lawrence wrote:

        Hi guys (Val, Martin, Herve):

        Anyone have an itch for optimization? The writeVcf function is
        currently a
        bottleneck in our WGS genotyping pipeline. For a typical 50
        million row
        gVCF, it was taking 2.25 hours prior to yesterday's improvements
        (pasteCollapseRows) that brought it down to about 1 hour, which
        is still
        too long by my standards (> 0). Only takes 3 minutes to call the
        genotypes
        (and associated likelihoods etc) from the variant calls (using
        80 cores and
        450 GB RAM on one node), so the output is an issue. Profiling
        suggests that
        the running time scales non-linearly in the number of rows.

        Digging a little deeper, it seems to be something with R's
        string/memory
        allocation. Below, pasting 1 million strings takes 6 seconds, but
10
        million strings takes over 2 minutes. It gets way worse with 50
        million. I
        suspect it has something to do with R's string hash table.

        set.seed(1000)
        end <- sample(1e8, 1e6)
        system.time(paste0("END", "=", end))
             user  system elapsed
            6.396   0.028   6.420

        end <- sample(1e8, 1e7)
        system.time(paste0("END", "=", end))
             user  system elapsed
        134.714   0.352 134.978

        Indeed, even this takes a long time (in a fresh session):

        set.seed(1000)
        end <- sample(1e8, 1e6)
        end <- sample(1e8, 1e7)
        system.time(as.character(end))
             user  system elapsed
           57.224   0.156  57.366

        But running it a second time is faster (about what one would
        expect?):

        system.time(levels <- as.character(end))
             user  system elapsed
           23.582   0.021  23.589

        I did some simple profiling of R to find that the resizing of
        the string
        hash table is not a significant component of the time. So maybe
        something
        to do with the R heap/gc? No time right now to go deeper. But I
        know Martin
        likes this sort of thing ;)

        Michael

                 [[alternative HTML version deleted]]

        _________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

    _________________________________________________
    Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
list
    https://stat.ethz.ch/mailman/__listinfo/bioc-devel

    <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

--
Computational Biologist
Genentech Research

I didn't see in the original thread a reproducible (simulated, I guess) example, 
to be explicit about what the problem is??

Martin
My understanding is that the heap optimization provided marginal gains, and
that we need to think harder about how to optimize the all of the string
manipulation in writeVcf. We either need to reduce it or reduce its
overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.

On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:

Hi Gabe,

Martin responded, and so did Michael,

https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

It sounded like Michael was ok with working with/around heap
initialization.

Michael, is that right or should we still consider this on the table?

Val

On 08/26/2014 09:34 AM, Gabe Becker wrote:

Val,

Has there been any movement on this? This remains a substantial
bottleneck for us when writing very large VCF files (e.g.
variants+genotypes for whole genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup
of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
parallelization strategy and no other changes. I suspect this could be
improved on quite a bit, or possibly made irrelevant with judicious use
of serial C code.

Did you and Martin make any plans regarding optimizing writeVcf?

Best
~G

On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org
<mailto:vobencha at fhcrc.org>> wrote:

     Hi Michael,

     I'm interested in working on this. I'll discuss with Martin next
     week when we're both back in the office.

     Val

     On 08/05/14 07:46, Michael Lawrence wrote:

         Hi guys (Val, Martin, Herve):

         Anyone have an itch for optimization? The writeVcf function is
         currently a
         bottleneck in our WGS genotyping pipeline. For a typical 50
         million row
         gVCF, it was taking 2.25 hours prior to yesterday's improvements
         (pasteCollapseRows) that brought it down to about 1 hour, which
         is still
         too long by my standards (> 0). Only takes 3 minutes to call the
         genotypes
         (and associated likelihoods etc) from the variant calls (using
         80 cores and
         450 GB RAM on one node), so the output is an issue. Profiling
         suggests that
         the running time scales non-linearly in the number of rows.

         Digging a little deeper, it seems to be something with R's
         string/memory
         allocation. Below, pasting 1 million strings takes 6 seconds, but
10
         million strings takes over 2 minutes. It gets way worse with 50
         million. I
         suspect it has something to do with R's string hash table.

         set.seed(1000)
         end <- sample(1e8, 1e6)
         system.time(paste0("END", "=", end))
              user  system elapsed
             6.396   0.028   6.420

         end <- sample(1e8, 1e7)
         system.time(paste0("END", "=", end))
              user  system elapsed
         134.714   0.352 134.978

         Indeed, even this takes a long time (in a fresh session):

         set.seed(1000)
         end <- sample(1e8, 1e6)
         end <- sample(1e8, 1e7)
         system.time(as.character(end))
              user  system elapsed
            57.224   0.156  57.366

         But running it a second time is faster (about what one would
         expect?):

         system.time(levels <- as.character(end))
              user  system elapsed
            23.582   0.021  23.589

         I did some simple profiling of R to find that the resizing of
         the string
         hash table is not a significant component of the time. So maybe
         something
         to do with the R heap/gc? No time right now to go deeper. But I
         know Martin
         likes this sort of thing ;)

         Michael

                  [[alternative HTML version deleted]]

         _________________________________________________
         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
         mailing list
         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

     _________________________________________________
     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
list
     https://stat.ethz.ch/mailman/__listinfo/bioc-devel

     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

--
Computational Biologist
Genentech Research

	[[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793
Gabe is still testing/profiling, but we'll send something randomized along
eventually.

I didn't see in the original thread a reproducible (simulated, I guess)
example, to be explicit about what the problem is??

Martin

On 08/26/2014 10:47 AM, Michael Lawrence wrote:

My understanding is that the heap optimization provided marginal gains,
and
that we need to think harder about how to optimize the all of the string
manipulation in writeVcf. We either need to reduce it or reduce its
overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.

On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:

 Hi Gabe,
Martin responded, and so did Michael,

https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

It sounded like Michael was ok with working with/around heap
initialization.

Michael, is that right or should we still consider this on the table?

Val

On 08/26/2014 09:34 AM, Gabe Becker wrote:

 Val,
Has there been any movement on this? This remains a substantial
bottleneck for us when writing very large VCF files (e.g.
variants+genotypes for whole genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup
of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
parallelization strategy and no other changes. I suspect this could be
improved on quite a bit, or possibly made irrelevant with judicious use
of serial C code.

Did you and Martin make any plans regarding optimizing writeVcf?

Best
~G

On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org
<mailto:vobencha at fhcrc.org>> wrote:

     Hi Michael,

     I'm interested in working on this. I'll discuss with Martin next
     week when we're both back in the office.

     Val

     On 08/05/14 07:46, Michael Lawrence wrote:

         Hi guys (Val, Martin, Herve):

         Anyone have an itch for optimization? The writeVcf function is
         currently a
         bottleneck in our WGS genotyping pipeline. For a typical 50
         million row
         gVCF, it was taking 2.25 hours prior to yesterday's
improvements
         (pasteCollapseRows) that brought it down to about 1 hour, which
         is still
         too long by my standards (> 0). Only takes 3 minutes to call
the
         genotypes
         (and associated likelihoods etc) from the variant calls (using
         80 cores and
         450 GB RAM on one node), so the output is an issue. Profiling
         suggests that
         the running time scales non-linearly in the number of rows.

         Digging a little deeper, it seems to be something with R's
         string/memory
         allocation. Below, pasting 1 million strings takes 6 seconds,
but
10
         million strings takes over 2 minutes. It gets way worse with 50
         million. I
         suspect it has something to do with R's string hash table.

         set.seed(1000)
         end <- sample(1e8, 1e6)
         system.time(paste0("END", "=", end))
              user  system elapsed
             6.396   0.028   6.420

         end <- sample(1e8, 1e7)
         system.time(paste0("END", "=", end))
              user  system elapsed
         134.714   0.352 134.978

         Indeed, even this takes a long time (in a fresh session):

         set.seed(1000)
         end <- sample(1e8, 1e6)
         end <- sample(1e8, 1e7)
         system.time(as.character(end))
              user  system elapsed
            57.224   0.156  57.366

         But running it a second time is faster (about what one would
         expect?):

         system.time(levels <- as.character(end))
              user  system elapsed
            23.582   0.021  23.589

         I did some simple profiling of R to find that the resizing of
         the string
         hash table is not a significant component of the time. So maybe
         something
         to do with the R heap/gc? No time right now to go deeper. But I
         know Martin
         likes this sort of thing ;)

         Michael

                  [[alternative HTML version deleted]]

         _________________________________________________
         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
         mailing list
         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

     _________________________________________________
     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
list
     https://stat.ethz.ch/mailman/__listinfo/bioc-devel

     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

--
Computational Biologist
Genentech Research

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

Martin and Val.

I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with
profiling enabled. The results of summaryRprof for that run are attached,
though for a variety of reasons they are pretty misleading.

It took over an hour to write (3700+seconds), so it's definitely a
bottleneck when the data get very large, even if it isn't for smaller data.

Michael and I both think the culprit is all the pasting and cbinding that
is going on, and more to the point, that memory for an internal
representation to be written out is allocated at all.  Streaming across the
object, looping by rows and writing directly to file (e.g. from C) should
be blisteringly fast in comparison.

~G

On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <michafla at gene.com>
wrote:
Gabe is still testing/profiling, but we'll send something randomized along
eventually.

On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmorgan at fhcrc.org>
wrote:

I didn't see in the original thread a reproducible (simulated, I guess)
example, to be explicit about what the problem is??

Martin

On 08/26/2014 10:47 AM, Michael Lawrence wrote:

My understanding is that the heap optimization provided marginal gains,
and
that we need to think harder about how to optimize the all of the string
manipulation in writeVcf. We either need to reduce it or reduce its
overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.

On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:

 Hi Gabe,
Martin responded, and so did Michael,

https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

It sounded like Michael was ok with working with/around heap
initialization.

Michael, is that right or should we still consider this on the table?

Val

On 08/26/2014 09:34 AM, Gabe Becker wrote:

 Val,
Has there been any movement on this? This remains a substantial
bottleneck for us when writing very large VCF files (e.g.
variants+genotypes for whole genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup
of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
parallelization strategy and no other changes. I suspect this could be
improved on quite a bit, or possibly made irrelevant with judicious use
of serial C code.

Did you and Martin make any plans regarding optimizing writeVcf?

Best
~G

On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org
<mailto:vobencha at fhcrc.org>> wrote:

     Hi Michael,

     I'm interested in working on this. I'll discuss with Martin next
     week when we're both back in the office.

     Val

     On 08/05/14 07:46, Michael Lawrence wrote:

         Hi guys (Val, Martin, Herve):

         Anyone have an itch for optimization? The writeVcf function is
         currently a
         bottleneck in our WGS genotyping pipeline. For a typical 50
         million row
         gVCF, it was taking 2.25 hours prior to yesterday's
improvements
         (pasteCollapseRows) that brought it down to about 1 hour,
which
         is still
         too long by my standards (> 0). Only takes 3 minutes to call
the
         genotypes
         (and associated likelihoods etc) from the variant calls (using
         80 cores and
         450 GB RAM on one node), so the output is an issue. Profiling
         suggests that
         the running time scales non-linearly in the number of rows.

         Digging a little deeper, it seems to be something with R's
         string/memory
         allocation. Below, pasting 1 million strings takes 6 seconds,
but
10
         million strings takes over 2 minutes. It gets way worse with
50
         million. I
         suspect it has something to do with R's string hash table.

         set.seed(1000)
         end <- sample(1e8, 1e6)
         system.time(paste0("END", "=", end))
              user  system elapsed
             6.396   0.028   6.420

         end <- sample(1e8, 1e7)
         system.time(paste0("END", "=", end))
              user  system elapsed
         134.714   0.352 134.978

         Indeed, even this takes a long time (in a fresh session):

         set.seed(1000)
         end <- sample(1e8, 1e6)
         end <- sample(1e8, 1e7)
         system.time(as.character(end))
              user  system elapsed
            57.224   0.156  57.366

         But running it a second time is faster (about what one would
         expect?):

         system.time(levels <- as.character(end))
              user  system elapsed
            23.582   0.021  23.589

         I did some simple profiling of R to find that the resizing of
         the string
         hash table is not a significant component of the time. So
maybe
         something
         to do with the R heap/gc? No time right now to go deeper. But
I
         know Martin
         likes this sort of thing ;)

         Michael

                  [[alternative HTML version deleted]]

         _________________________________________________
         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
         mailing list
         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

     _________________________________________________
     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing
list
     https://stat.ethz.ch/mailman/__listinfo/bioc-devel

     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

--
Computational Biologist
Genentech Research

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

Computational Biologist
Genentech Research
The profiling I attached in my previous email is for 24 geno fields, as I
said, but our typical usecase involves only ~4-6 fields, and is faster but
still on the order of dozens of minutes.

Sorry for the confusion.
~G

Martin and Val.

I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields)
with profiling enabled. The results of summaryRprof for that run are
attached, though for a variety of reasons they are pretty misleading.

It took over an hour to write (3700+seconds), so it's definitely a
bottleneck when the data get very large, even if it isn't for smaller data.

Michael and I both think the culprit is all the pasting and cbinding that
is going on, and more to the point, that memory for an internal
representation to be written out is allocated at all.  Streaming across the
object, looping by rows and writing directly to file (e.g. from C) should
be blisteringly fast in comparison.

~G

On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <michafla at gene.com>
wrote:

Gabe is still testing/profiling, but we'll send something randomized
along eventually.

On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmorgan at fhcrc.org>
wrote:

I didn't see in the original thread a reproducible (simulated, I guess)
example, to be explicit about what the problem is??

Martin

On 08/26/2014 10:47 AM, Michael Lawrence wrote:

My understanding is that the heap optimization provided marginal gains,
and
that we need to think harder about how to optimize the all of the string
manipulation in writeVcf. We either need to reduce it or reduce its
overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.

On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:

 Hi Gabe,
Martin responded, and so did Michael,

https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

It sounded like Michael was ok with working with/around heap
initialization.

Michael, is that right or should we still consider this on the table?

Val

On 08/26/2014 09:34 AM, Gabe Becker wrote:

 Val,
Has there been any movement on this? This remains a substantial
bottleneck for us when writing very large VCF files (e.g.
variants+genotypes for whole genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an "optimal"
speedup
of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
parallelization strategy and no other changes. I suspect this could be
improved on quite a bit, or possibly made irrelevant with judicious
use
of serial C code.

Did you and Martin make any plans regarding optimizing writeVcf?

Best
~G

On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <vobencha at fhcrc.org
<mailto:vobencha at fhcrc.org>> wrote:

     Hi Michael,

     I'm interested in working on this. I'll discuss with Martin next
     week when we're both back in the office.

     Val

     On 08/05/14 07:46, Michael Lawrence wrote:

         Hi guys (Val, Martin, Herve):

         Anyone have an itch for optimization? The writeVcf function
is
         currently a
         bottleneck in our WGS genotyping pipeline. For a typical 50
         million row
         gVCF, it was taking 2.25 hours prior to yesterday's
improvements
         (pasteCollapseRows) that brought it down to about 1 hour,
which
         is still
         too long by my standards (> 0). Only takes 3 minutes to call
the
         genotypes
         (and associated likelihoods etc) from the variant calls
(using
         80 cores and
         450 GB RAM on one node), so the output is an issue. Profiling
         suggests that
         the running time scales non-linearly in the number of rows.

         Digging a little deeper, it seems to be something with R's
         string/memory
         allocation. Below, pasting 1 million strings takes 6
seconds, but
10
         million strings takes over 2 minutes. It gets way worse with
50
         million. I
         suspect it has something to do with R's string hash table.

         set.seed(1000)
         end <- sample(1e8, 1e6)
         system.time(paste0("END", "=", end))
              user  system elapsed
             6.396   0.028   6.420

         end <- sample(1e8, 1e7)
         system.time(paste0("END", "=", end))
              user  system elapsed
         134.714   0.352 134.978

         Indeed, even this takes a long time (in a fresh session):

         set.seed(1000)
         end <- sample(1e8, 1e6)
         end <- sample(1e8, 1e7)
         system.time(as.character(end))
              user  system elapsed
            57.224   0.156  57.366

         But running it a second time is faster (about what one would
         expect?):

         system.time(levels <- as.character(end))
              user  system elapsed
            23.582   0.021  23.589

         I did some simple profiling of R to find that the resizing of
         the string
         hash table is not a significant component of the time. So
maybe
         something
         to do with the R heap/gc? No time right now to go deeper.
But I
         know Martin
         likes this sort of thing ;)

         Michael

                  [[alternative HTML version deleted]]

         _________________________________________________
         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
         mailing list
         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

     _________________________________________________
     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing
list
     https://stat.ethz.ch/mailman/__listinfo/bioc-devel

     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

--
Computational Biologist
Genentech Research

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

--
Computational Biologist
Genentech Research

Computational Biologist
Genentech Research

	[[alternative HTML version deleted]]
Try to run it through the lineprof package for memory profiling; I have
found this to be very helpful.

Here is an old blog post I wrote about it
  http://www.hansenlab.org/rstats/2014/01/30/lineprof/

Kasper

The profiling I attached in my previous email is for 24 geno fields, as I
said, but our typical usecase involves only ~4-6 fields, and is faster but
still on the order of dozens of minutes.

Sorry for the confusion.
~G

On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <beckerg4 at gene.com> wrote:

Martin and Val.

I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields)
with profiling enabled. The results of summaryRprof for that run are
attached, though for a variety of reasons they are pretty misleading.

It took over an hour to write (3700+seconds), so it's definitely a
bottleneck when the data get very large, even if it isn't for smaller
data.
Michael and I both think the culprit is all the pasting and cbinding that
is going on, and more to the point, that memory for an internal
representation to be written out is allocated at all.  Streaming across
the
object, looping by rows and writing directly to file (e.g. from C) should
be blisteringly fast in comparison.

~G

On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <michafla at gene.com>
wrote:

Gabe is still testing/profiling, but we'll send something randomized
along eventually.

On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmorgan at fhcrc.org>
wrote:

I didn't see in the original thread a reproducible (simulated, I guess)
example, to be explicit about what the problem is??

Martin

On 08/26/2014 10:47 AM, Michael Lawrence wrote:

My understanding is that the heap optimization provided marginal
gains,
and
that we need to think harder about how to optimize the all of the
string
manipulation in writeVcf. We either need to reduce it or reduce its
overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.

On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain <
vobencha at fhcrc.org>
wrote:

 Hi Gabe,
Martin responded, and so did Michael,

https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

It sounded like Michael was ok with working with/around heap
initialization.

Michael, is that right or should we still consider this on the table?

Val

On 08/26/2014 09:34 AM, Gabe Becker wrote:

 Val,
Has there been any movement on this? This remains a substantial
bottleneck for us when writing very large VCF files (e.g.
variants+genotypes for whole genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an "optimal"
speedup
of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
parallelization strategy and no other changes. I suspect this could
be
improved on quite a bit, or possibly made irrelevant with judicious
use
of serial C code.

Did you and Martin make any plans regarding optimizing writeVcf?

Best
~G

On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain <
vobencha at fhcrc.org
<mailto:vobencha at fhcrc.org>> wrote:

     Hi Michael,

     I'm interested in working on this. I'll discuss with Martin
next
     week when we're both back in the office.

     Val

     On 08/05/14 07:46, Michael Lawrence wrote:

         Hi guys (Val, Martin, Herve):

         Anyone have an itch for optimization? The writeVcf function
is
         currently a
         bottleneck in our WGS genotyping pipeline. For a typical 50
         million row
         gVCF, it was taking 2.25 hours prior to yesterday's
improvements
         (pasteCollapseRows) that brought it down to about 1 hour,
which
         is still
         too long by my standards (> 0). Only takes 3 minutes to
call
the
         genotypes
         (and associated likelihoods etc) from the variant calls
(using
         80 cores and
         450 GB RAM on one node), so the output is an issue.
Profiling
         suggests that
         the running time scales non-linearly in the number of rows.

         Digging a little deeper, it seems to be something with R's
         string/memory
         allocation. Below, pasting 1 million strings takes 6
seconds, but
10
         million strings takes over 2 minutes. It gets way worse
with
50
         million. I
         suspect it has something to do with R's string hash table.

         set.seed(1000)
         end <- sample(1e8, 1e6)
         system.time(paste0("END", "=", end))
              user  system elapsed
             6.396   0.028   6.420

         end <- sample(1e8, 1e7)
         system.time(paste0("END", "=", end))
              user  system elapsed
         134.714   0.352 134.978

         Indeed, even this takes a long time (in a fresh session):

         set.seed(1000)
         end <- sample(1e8, 1e6)
         end <- sample(1e8, 1e7)
         system.time(as.character(end))
              user  system elapsed
            57.224   0.156  57.366

         But running it a second time is faster (about what one
would
         expect?):

         system.time(levels <- as.character(end))
              user  system elapsed
            23.582   0.021  23.589

         I did some simple profiling of R to find that the resizing
of
         the string
         hash table is not a significant component of the time. So
maybe
         something
         to do with the R heap/gc? No time right now to go deeper.
But I
         know Martin
         likes this sort of thing ;)

         Michael

                  [[alternative HTML version deleted]]

         _________________________________________________
         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
         mailing list
         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

     _________________________________________________
     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing
list
     https://stat.ethz.ch/mailman/__listinfo/bioc-devel

     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

--
Computational Biologist
Genentech Research

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

--
Computational Biologist
Genentech Research

--
Computational Biologist
Genentech Research

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel