[Bioc-devel] writeVcf performance

The profiling I attached in my previous email is for 24 geno fields, as I said,
but our typical usecase involves only ~4-6 fields, and is faster but still on
the order of dozens of minutes.
I think Val is arriving at a (much) more efficient implementation, but...

I wanted to share my guess that the poor _scaling_ is because the garbage 
collector runs multiple times as the different strings are pasted together, and 
has to traverse, in linear time, increasing numbers of allocated SEXPs. So times 
scale approximately quadratically with the number of rows in the VCF

An efficiency is to reduce the number of SEXPs in play by writing out in chunks 
-- as each chunk is written, the SEXPs become available for collection and are 
re-used. Here's my toy example

time.R
======
splitIndices <- function (nx, ncl)
{
     i <- seq_len(nx)
     if (ncl == 0L)
         list()
     else if (ncl == 1L || nx == 1L)
         list(i)
     else {
         fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
         breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
         structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
     }
}

x = as.character(seq_len(1e7)); y = sample(x)
if (!is.na(Sys.getenv("SPLIT", NA))) {
     idx <- splitIndices(length(x), 20)
     system.time(for (i in idx) paste(x[i], y[i], sep=":"))
} else {
     system.time(paste(x, y, sep=":"))
}

running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the 
relevant time is

    user  system elapsed
  15.320   0.064  15.381

versus with $ R --no-save --quiet -f time.R it is

    user  system elapsed
  95.360   0.164  95.511

I think this is likely an overall strategy when dealing with character data -- 
processing in independent chunks of moderate (1M?) size (enabling as a 
consequence parallel evaluation in modest memory) that are sufficient to benefit 
from vectorization, but that do not entail allocation of large numbers of in-use 
SEXPs.

Martin
Sorry for the confusion.
~G

On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <beckerg4 at gene.com
<mailto:beckerg4 at gene.com>> wrote:

    Martin and Val.

    I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno fields) with
    profiling enabled. The results of summaryRprof for that run are attached,
    though for a variety of reasons they are pretty misleading.

    It took over an hour to write (3700+seconds), so it's definitely a
    bottleneck when the data get very large, even if it isn't for smaller data.

    Michael and I both think the culprit is all the pasting and cbinding that is
    going on, and more to the point, that memory for an internal representation
    to be written out is allocated at all.  Streaming across the object, looping
    by rows and writing directly to file (e.g. from C) should be blisteringly
    fast in comparison.

    ~G

    On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <michafla at gene.com
    <mailto:michafla at gene.com>> wrote:

        Gabe is still testing/profiling, but we'll send something randomized
        along eventually.

        On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>> wrote:

            I didn't see in the original thread a reproducible (simulated, I
            guess) example, to be explicit about what the problem is??

            Martin

            On 08/26/2014 10:47 AM, Michael Lawrence wrote:

                My understanding is that the heap optimization provided marginal
                gains, and
                that we need to think harder about how to optimize the all of
                the string
                manipulation in writeVcf. We either need to reduce it or reduce its
                overhead (i.e., the CHARSXP allocation). Gabe is doing more tests.

                On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain
                <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
                wrote:

                    Hi Gabe,

                    Martin responded, and so did Michael,

                    https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
                    <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>

                    It sounded like Michael was ok with working with/around heap
                    initialization.

                    Michael, is that right or should we still consider this on
                    the table?

                    Val

                    On 08/26/2014 09:34 AM, Gabe Becker wrote:

                        Val,

                        Has there been any movement on this? This remains a
                        substantial
                        bottleneck for us when writing very large VCF files (e.g.
                        variants+genotypes for whole genome NGS samples).

                        I was able to see a ~25% speedup with 4 cores and  an
                        "optimal" speedup
                        of ~2x with 10-12 cores for a VCF with 500k rows  using
                        a very naive
                        parallelization strategy and no other changes. I suspect
                        this could be
                        improved on quite a bit, or possibly made irrelevant
                        with judicious use
                        of serial C code.

                        Did you and Martin make any plans regarding optimizing
                        writeVcf?

                        Best
                        ~G

                        On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain
                        <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
                        <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>>
                        wrote:

                              Hi Michael,

                              I'm interested in working on this. I'll discuss
                        with Martin next
                              week when we're both back in the office.

                              Val

                              On 08/05/14 07:46, Michael Lawrence wrote:

                                  Hi guys (Val, Martin, Herve):

                                  Anyone have an itch for optimization? The
                        writeVcf function is
                                  currently a
                                  bottleneck in our WGS genotyping pipeline. For
                        a typical 50
                                  million row
                                  gVCF, it was taking 2.25 hours prior to
                        yesterday's improvements
                                  (pasteCollapseRows) that brought it down to
                        about 1 hour, which
                                  is still
                                  too long by my standards (> 0). Only takes 3
                        minutes to call the
                                  genotypes
                                  (and associated likelihoods etc) from the
                        variant calls (using
                                  80 cores and
                                  450 GB RAM on one node), so the output is an
                        issue. Profiling
                                  suggests that
                                  the running time scales non-linearly in the
                        number of rows.

                                  Digging a little deeper, it seems to be
                        something with R's
                                  string/memory
                                  allocation. Below, pasting 1 million strings
                        takes 6 seconds, but
                        10
                                  million strings takes over 2 minutes. It gets
                        way worse with 50
                                  million. I
                                  suspect it has something to do with R's string
                        hash table.

                                  set.seed(1000)
                                  end <- sample(1e8, 1e6)
                                  system.time(paste0("END", "=", end))
                                       user  system elapsed
                                      6.396   0.028   6.420

                                  end <- sample(1e8, 1e7)
                                  system.time(paste0("END", "=", end))
                                       user  system elapsed
                                  134.714   0.352 134.978

                                  Indeed, even this takes a long time (in a
                        fresh session):

                                  set.seed(1000)
                                  end <- sample(1e8, 1e6)
                                  end <- sample(1e8, 1e7)
                                  system.time(as.character(end))
                                       user  system elapsed
                                     57.224   0.156  57.366

                                  But running it a second time is faster (about
                        what one would
                                  expect?):

                                  system.time(levels <- as.character(end))
                                       user  system elapsed
                                     23.582   0.021  23.589

                                  I did some simple profiling of R to find that
                        the resizing of
                                  the string
                                  hash table is not a significant component of
                        the time. So maybe
                                  something
                                  to do with the R heap/gc? No time right now to
                        go deeper. But I
                                  know Martin
                                  likes this sort of thing ;)

                                  Michael

                                           [[alternative HTML version deleted]]

                          ___________________________________________________
                        Bioc-devel at r-project.org
                        <mailto:Bioc-devel at r-project.org>
                        <mailto:Bioc-devel at r-project.__org
                        <mailto:Bioc-devel at r-project.org>>
                                  mailing list
                        https://stat.ethz.ch/mailman/____listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

                          <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>

                              ___________________________________________________
                        Bioc-devel at r-project.org
                        <mailto:Bioc-devel at r-project.org>
                        <mailto:Bioc-devel at r-project.__org
                        <mailto:Bioc-devel at r-project.org>> mailing
                        list
                        https://stat.ethz.ch/mailman/____listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

                          <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>

                        --
                        Computational Biologist
                        Genentech Research

                         [[alternative HTML version deleted]]

                _________________________________________________
                Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
                mailing list
                https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

            --
            Computational Biology / Fred Hutchinson Cancer Research Center
            1100 Fairview Ave. N.
            PO Box 19024 Seattle, WA 98109

            Location: Arnold Building M1 B861
            Phone: (206) 667-2793 <tel:%28206%29%20667-2793>

    --
    Computational Biologist
    Genentech Research

--
Computational Biologist
Genentech Research

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793
Yes, it's very clear that the scaling is non-linear, and Gabe has been
experimenting with a chunk-wise + parallel algorithm. Unfortunately there
is some frustrating overhead with the parallelism. But I'm glad Val is
arriving at something quicker.

Michael

On 08/27/2014 11:56 AM, Gabe Becker wrote:

The profiling I attached in my previous email is for 24 geno fields, as I
said,
but our typical usecase involves only ~4-6 fields, and is faster but
still on
the order of dozens of minutes.

I think Val is arriving at a (much) more efficient implementation, but...

I wanted to share my guess that the poor _scaling_ is because the garbage
collector runs multiple times as the different strings are pasted together,
and has to traverse, in linear time, increasing numbers of allocated SEXPs.
So times scale approximately quadratically with the number of rows in the
VCF

An efficiency is to reduce the number of SEXPs in play by writing out in
chunks -- as each chunk is written, the SEXPs become available for
collection and are re-used. Here's my toy example

time.R
======
splitIndices <- function (nx, ncl)
{
    i <- seq_len(nx)
    if (ncl == 0L)
        list()
    else if (ncl == 1L || nx == 1L)
        list(i)
    else {
        fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
        breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
        structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
    }
}

x = as.character(seq_len(1e7)); y = sample(x)
if (!is.na(Sys.getenv("SPLIT", NA))) {
    idx <- splitIndices(length(x), 20)
    system.time(for (i in idx) paste(x[i], y[i], sep=":"))
} else {
    system.time(paste(x, y, sep=":"))
}

running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the
relevant time is

   user  system elapsed
 15.320   0.064  15.381

versus with $ R --no-save --quiet -f time.R it is

   user  system elapsed
 95.360   0.164  95.511

I think this is likely an overall strategy when dealing with character
data -- processing in independent chunks of moderate (1M?) size (enabling
as a consequence parallel evaluation in modest memory) that are sufficient
to benefit from vectorization, but that do not entail allocation of large
numbers of in-use SEXPs.

Martin

Sorry for the confusion.
~G

On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <beckerg4 at gene.com
<mailto:beckerg4 at gene.com>> wrote:

    Martin and Val.

    I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno
fields) with
    profiling enabled. The results of summaryRprof for that run are
attached,
    though for a variety of reasons they are pretty misleading.

    It took over an hour to write (3700+seconds), so it's definitely a
    bottleneck when the data get very large, even if it isn't for smaller
data.

    Michael and I both think the culprit is all the pasting and cbinding
that is
    going on, and more to the point, that memory for an internal
representation
    to be written out is allocated at all.  Streaming across the object,
looping
    by rows and writing directly to file (e.g. from C) should be
blisteringly
    fast in comparison.

    ~G

    On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <michafla at gene.com
    <mailto:michafla at gene.com>> wrote:

        Gabe is still testing/profiling, but we'll send something
randomized
        along eventually.

        On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <
mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>> wrote:

            I didn't see in the original thread a reproducible
(simulated, I
            guess) example, to be explicit about what the problem is??

            Martin

            On 08/26/2014 10:47 AM, Michael Lawrence wrote:

                My understanding is that the heap optimization provided
marginal
                gains, and
                that we need to think harder about how to optimize the
all of
                the string
                manipulation in writeVcf. We either need to reduce it or
reduce its
                overhead (i.e., the CHARSXP allocation). Gabe is doing
more tests.

                On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain
                <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>

                wrote:

                    Hi Gabe,

                    Martin responded, and so did Michael,

                    https://stat.ethz.ch/__pipermail/bioc-devel/2014-__
August/006082.html

                    <https://stat.ethz.ch/pipermail/bioc-devel/2014-
August/006082.html>

                    It sounded like Michael was ok with working
with/around heap
                    initialization.

                    Michael, is that right or should we still consider
this on
                    the table?

                    Val

                    On 08/26/2014 09:34 AM, Gabe Becker wrote:

                        Val,

                        Has there been any movement on this? This remains
a
                        substantial
                        bottleneck for us when writing very large VCF
files (e.g.
                        variants+genotypes for whole genome NGS samples).

                        I was able to see a ~25% speedup with 4 cores
and  an
                        "optimal" speedup
                        of ~2x with 10-12 cores for a VCF with 500k rows
using
                        a very naive
                        parallelization strategy and no other changes. I
suspect
                        this could be
                        improved on quite a bit, or possibly made
irrelevant
                        with judicious use
                        of serial C code.

                        Did you and Martin make any plans regarding
optimizing
                        writeVcf?

                        Best
                        ~G

                        On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain
                        <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
                        <mailto:vobencha at fhcrc.org <mailto:
vobencha at fhcrc.org>>>

                        wrote:

                              Hi Michael,

                              I'm interested in working on this. I'll
discuss
                        with Martin next
                              week when we're both back in the office.

                              Val

                              On 08/05/14 07:46, Michael Lawrence wrote:

                                  Hi guys (Val, Martin, Herve):

                                  Anyone have an itch for optimization?
The
                        writeVcf function is
                                  currently a
                                  bottleneck in our WGS genotyping
pipeline. For
                        a typical 50
                                  million row
                                  gVCF, it was taking 2.25 hours prior to
                        yesterday's improvements
                                  (pasteCollapseRows) that brought it
down to
                        about 1 hour, which
                                  is still
                                  too long by my standards (> 0). Only
takes 3
                        minutes to call the
                                  genotypes
                                  (and associated likelihoods etc) from
the
                        variant calls (using
                                  80 cores and
                                  450 GB RAM on one node), so the output
is an
                        issue. Profiling
                                  suggests that
                                  the running time scales non-linearly in
the
                        number of rows.

                                  Digging a little deeper, it seems to be
                        something with R's
                                  string/memory
                                  allocation. Below, pasting 1 million
strings
                        takes 6 seconds, but
                        10
                                  million strings takes over 2 minutes.
It gets
                        way worse with 50
                                  million. I
                                  suspect it has something to do with R's
string
                        hash table.

                                  set.seed(1000)
                                  end <- sample(1e8, 1e6)
                                  system.time(paste0("END", "=", end))
                                       user  system elapsed
                                      6.396   0.028   6.420

                                  end <- sample(1e8, 1e7)
                                  system.time(paste0("END", "=", end))
                                       user  system elapsed
                                  134.714   0.352 134.978

                                  Indeed, even this takes a long time (in
a
                        fresh session):

                                  set.seed(1000)
                                  end <- sample(1e8, 1e6)
                                  end <- sample(1e8, 1e7)
                                  system.time(as.character(end))
                                       user  system elapsed
                                     57.224   0.156  57.366

                                  But running it a second time is faster
(about
                        what one would
                                  expect?):

                                  system.time(levels <- as.character(end))
                                       user  system elapsed
                                     23.582   0.021  23.589

                                  I did some simple profiling of R to
find that
                        the resizing of
                                  the string
                                  hash table is not a significant
component of
                        the time. So maybe
                                  something
                                  to do with the R heap/gc? No time right
now to
                        go deeper. But I
                                  know Martin
                                  likes this sort of thing ;)

                                  Michael

                                           [[alternative HTML version
deleted]]

                          ______________________________
_____________________
                        Bioc-devel at r-project.org
                        <mailto:Bioc-devel at r-project.org>
                        <mailto:Bioc-devel at r-project.__org
                        <mailto:Bioc-devel at r-project.org>>
                                  mailing list

                        https://stat.ethz.ch/mailman/_
___listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/
__listinfo/bioc-devel>

                          <https://stat.ethz.ch/mailman/
__listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/listinfo/bioc-devel

                              ______________________________
_____________________
                        Bioc-devel at r-project.org
                        <mailto:Bioc-devel at r-project.org>
                        <mailto:Bioc-devel at r-project.__org
                        <mailto:Bioc-devel at r-project.org>> mailing
                        list
                        https://stat.ethz.ch/mailman/_
___listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/
__listinfo/bioc-devel>

                          <https://stat.ethz.ch/mailman/
__listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/listinfo/bioc-devel

                        --
                        Computational Biologist
                        Genentech Research

                         [[alternative HTML version deleted]]

                _________________________________________________
                Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org

                mailing list
                https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

            --
            Computational Biology / Fred Hutchinson Cancer Research Center
            1100 Fairview Ave. N.
            PO Box 19024 Seattle, WA 98109

            Location: Arnold Building M1 B861
            Phone: (206) 667-2793 <tel:%28206%29%20667-2793>

    --
    Computational Biologist
    Genentech Research

--
Computational Biologist
Genentech Research

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into R)
stuff.

ftp://platgene:G3n3s4me at ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took
about ~4.2 hrs to write out, and is about 1.5x the size of the files we are
actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G

On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence <lawrence.michael at gene.com>
wrote:
Yes, it's very clear that the scaling is non-linear, and Gabe has been
experimenting with a chunk-wise + parallel algorithm. Unfortunately there
is some frustrating overhead with the parallelism. But I'm glad Val is
arriving at something quicker.

Michael

On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:

On 08/27/2014 11:56 AM, Gabe Becker wrote:

The profiling I attached in my previous email is for 24 geno fields, as
I said,
but our typical usecase involves only ~4-6 fields, and is faster but
still on
the order of dozens of minutes.

I think Val is arriving at a (much) more efficient implementation, but...

I wanted to share my guess that the poor _scaling_ is because the garbage
collector runs multiple times as the different strings are pasted together,
and has to traverse, in linear time, increasing numbers of allocated SEXPs.
So times scale approximately quadratically with the number of rows in the
VCF

An efficiency is to reduce the number of SEXPs in play by writing out in
chunks -- as each chunk is written, the SEXPs become available for
collection and are re-used. Here's my toy example

time.R
======
splitIndices <- function (nx, ncl)
{
    i <- seq_len(nx)
    if (ncl == 0L)
        list()
    else if (ncl == 1L || nx == 1L)
        list(i)
    else {
        fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
        breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
        structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
    }
}

x = as.character(seq_len(1e7)); y = sample(x)
if (!is.na(Sys.getenv("SPLIT", NA))) {
    idx <- splitIndices(length(x), 20)
    system.time(for (i in idx) paste(x[i], y[i], sep=":"))
} else {
    system.time(paste(x, y, sep=":"))
}

running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the
relevant time is

   user  system elapsed
 15.320   0.064  15.381

versus with $ R --no-save --quiet -f time.R it is

   user  system elapsed
 95.360   0.164  95.511

I think this is likely an overall strategy when dealing with character
data -- processing in independent chunks of moderate (1M?) size (enabling
as a consequence parallel evaluation in modest memory) that are sufficient
to benefit from vectorization, but that do not entail allocation of large
numbers of in-use SEXPs.

Martin

Sorry for the confusion.
~G

On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <beckerg4 at gene.com
<mailto:beckerg4 at gene.com>> wrote:

    Martin and Val.

    I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno
fields) with
    profiling enabled. The results of summaryRprof for that run are
attached,
    though for a variety of reasons they are pretty misleading.

    It took over an hour to write (3700+seconds), so it's definitely a
    bottleneck when the data get very large, even if it isn't for
smaller data.

    Michael and I both think the culprit is all the pasting and cbinding
that is
    going on, and more to the point, that memory for an internal
representation
    to be written out is allocated at all.  Streaming across the object,
looping
    by rows and writing directly to file (e.g. from C) should be
blisteringly
    fast in comparison.

    ~G

    On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <
michafla at gene.com
    <mailto:michafla at gene.com>> wrote:

        Gabe is still testing/profiling, but we'll send something
randomized
        along eventually.

        On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <
mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>> wrote:

            I didn't see in the original thread a reproducible
(simulated, I
            guess) example, to be explicit about what the problem is??

            Martin

            On 08/26/2014 10:47 AM, Michael Lawrence wrote:

                My understanding is that the heap optimization provided
marginal
                gains, and
                that we need to think harder about how to optimize the
all of
                the string
                manipulation in writeVcf. We either need to reduce it or
reduce its
                overhead (i.e., the CHARSXP allocation). Gabe is doing
more tests.

                On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain
                <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>

                wrote:

                    Hi Gabe,

                    Martin responded, and so did Michael,

                    https://stat.ethz.ch/__pipermail/bioc-devel/2014-__
August/006082.html

                    <https://stat.ethz.ch/pipermail/bioc-devel/2014-
August/006082.html>

                    It sounded like Michael was ok with working
with/around heap
                    initialization.

                    Michael, is that right or should we still consider
this on
                    the table?

                    Val

                    On 08/26/2014 09:34 AM, Gabe Becker wrote:

                        Val,

                        Has there been any movement on this? This
remains a
                        substantial
                        bottleneck for us when writing very large VCF
files (e.g.
                        variants+genotypes for whole genome NGS samples).

                        I was able to see a ~25% speedup with 4 cores
and  an
                        "optimal" speedup
                        of ~2x with 10-12 cores for a VCF with 500k
rows  using
                        a very naive
                        parallelization strategy and no other changes. I
suspect
                        this could be
                        improved on quite a bit, or possibly made
irrelevant
                        with judicious use
                        of serial C code.

                        Did you and Martin make any plans regarding
optimizing
                        writeVcf?

                        Best
                        ~G

                        On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain
                        <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
                        <mailto:vobencha at fhcrc.org <mailto:
vobencha at fhcrc.org>>>

                        wrote:

                              Hi Michael,

                              I'm interested in working on this. I'll
discuss
                        with Martin next
                              week when we're both back in the office.

                              Val

                              On 08/05/14 07:46, Michael Lawrence wrote:

                                  Hi guys (Val, Martin, Herve):

                                  Anyone have an itch for optimization?
The
                        writeVcf function is
                                  currently a
                                  bottleneck in our WGS genotyping
pipeline. For
                        a typical 50
                                  million row
                                  gVCF, it was taking 2.25 hours prior to
                        yesterday's improvements
                                  (pasteCollapseRows) that brought it
down to
                        about 1 hour, which
                                  is still
                                  too long by my standards (> 0). Only
takes 3
                        minutes to call the
                                  genotypes
                                  (and associated likelihoods etc) from
the
                        variant calls (using
                                  80 cores and
                                  450 GB RAM on one node), so the output
is an
                        issue. Profiling
                                  suggests that
                                  the running time scales non-linearly
in the
                        number of rows.

                                  Digging a little deeper, it seems to be
                        something with R's
                                  string/memory
                                  allocation. Below, pasting 1 million
strings
                        takes 6 seconds, but
                        10
                                  million strings takes over 2 minutes.
It gets
                        way worse with 50
                                  million. I
                                  suspect it has something to do with
R's string
                        hash table.

                                  set.seed(1000)
                                  end <- sample(1e8, 1e6)
                                  system.time(paste0("END", "=", end))
                                       user  system elapsed
                                      6.396   0.028   6.420

                                  end <- sample(1e8, 1e7)
                                  system.time(paste0("END", "=", end))
                                       user  system elapsed
                                  134.714   0.352 134.978

                                  Indeed, even this takes a long time
(in a
                        fresh session):

                                  set.seed(1000)
                                  end <- sample(1e8, 1e6)
                                  end <- sample(1e8, 1e7)
                                  system.time(as.character(end))
                                       user  system elapsed
                                     57.224   0.156  57.366

                                  But running it a second time is faster
(about
                        what one would
                                  expect?):

                                  system.time(levels <-
as.character(end))
                                       user  system elapsed
                                     23.582   0.021  23.589

                                  I did some simple profiling of R to
find that
                        the resizing of
                                  the string
                                  hash table is not a significant
component of
                        the time. So maybe
                                  something
                                  to do with the R heap/gc? No time
right now to
                        go deeper. But I
                                  know Martin
                                  likes this sort of thing ;)

                                  Michael

                                           [[alternative HTML version
deleted]]

                          ______________________________
_____________________
                        Bioc-devel at r-project.org
                        <mailto:Bioc-devel at r-project.org>
                        <mailto:Bioc-devel at r-project.__org
                        <mailto:Bioc-devel at r-project.org>>
                                  mailing list

                        https://stat.ethz.ch/mailman/_
___listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/
__listinfo/bioc-devel>

                          <https://stat.ethz.ch/mailman/
__listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/
listinfo/bioc-devel>>

                              ______________________________
_____________________
                        Bioc-devel at r-project.org
                        <mailto:Bioc-devel at r-project.org>
                        <mailto:Bioc-devel at r-project.__org
                        <mailto:Bioc-devel at r-project.org>> mailing
                        list
                        https://stat.ethz.ch/mailman/_
___listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/
__listinfo/bioc-devel>

                          <https://stat.ethz.ch/mailman/
__listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/
listinfo/bioc-devel>>

                        --
                        Computational Biologist
                        Genentech Research

                         [[alternative HTML version deleted]]

                _________________________________________________
                Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.
org>
                mailing list
                https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

            --
            Computational Biology / Fred Hutchinson Cancer Research
Center
            1100 Fairview Ave. N.
            PO Box 19024 Seattle, WA 98109

            Location: Arnold Building M1 B861
            Phone: (206) 667-2793 <tel:%28206%29%20667-2793>

    --
    Computational Biologist
    Genentech Research

--
Computational Biologist
Genentech Research

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

Computational Biologist
Genentech Research

	[[alternative HTML version deleted]]
Thanks Gabe. I should have something for you on Monday.

Val
Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into
R) stuff.

ftp://platgene:G3n3s4me at ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
took about ~4.2 hrs to write out, and is about 1.5x the size of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G

On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
<lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:

    Yes, it's very clear that the scaling is non-linear, and Gabe has
    been experimenting with a chunk-wise + parallel algorithm.
    Unfortunately there is some frustrating overhead with the
    parallelism. But I'm glad Val is arriving at something quicker.

    Michael

    On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan <mtmorgan at fhcrc.org
    <mailto:mtmorgan at fhcrc.org>> wrote:

        On 08/27/2014 11:56 AM, Gabe Becker wrote:

            The profiling I attached in my previous email is for 24 geno
            fields, as I said,
            but our typical usecase involves only ~4-6 fields, and is
            faster but still on
            the order of dozens of minutes.

        I think Val is arriving at a (much) more efficient
        implementation, but...

        I wanted to share my guess that the poor _scaling_ is because
        the garbage collector runs multiple times as the different
        strings are pasted together, and has to traverse, in linear
        time, increasing numbers of allocated SEXPs. So times scale
        approximately quadratically with the number of rows in the VCF

        An efficiency is to reduce the number of SEXPs in play by
        writing out in chunks -- as each chunk is written, the SEXPs
        become available for collection and are re-used. Here's my toy
        example

        time.R
        ======
        splitIndices <- function (nx, ncl)
        {
             i <- seq_len(nx)
             if (ncl == 0L)
                 list()
             else if (ncl == 1L || nx == 1L)
                 list(i)
             else {
                 fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
                 breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
                 structure(split(i, cut(i, breaks, labels=FALSE)), names
        = NULL)
             }
        }

        x = as.character(seq_len(1e7)); y = sample(x)
        if (!is.na <http://is.na>(Sys.getenv("SPLIT", NA))) {
             idx <- splitIndices(length(x), 20)
             system.time(for (i in idx) paste(x[i], y[i], sep=":"))
        } else {
             system.time(paste(x, y, sep=":"))
        }

        running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
        time.R the relevant time is

            user  system elapsed
          15.320   0.064  15.381

        versus with $ R --no-save --quiet -f time.R it is

            user  system elapsed
          95.360   0.164  95.511

        I think this is likely an overall strategy when dealing with
        character data -- processing in independent chunks of moderate
        (1M?) size (enabling as a consequence parallel evaluation in
        modest memory) that are sufficient to benefit from
        vectorization, but that do not entail allocation of large
        numbers of in-use SEXPs.

        Martin

            Sorry for the confusion.
            ~G

            On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
            <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
            <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>> wrote:

                 Martin and Val.

                 I re-ran writeVcf on our (G)VCF data (34790518 ranges,
            24 geno fields) with
                 profiling enabled. The results of summaryRprof for that
            run are attached,
                 though for a variety of reasons they are pretty misleading.

                 It took over an hour to write (3700+seconds), so it's
            definitely a
                 bottleneck when the data get very large, even if it
            isn't for smaller data.

                 Michael and I both think the culprit is all the pasting
            and cbinding that is
                 going on, and more to the point, that memory for an
            internal representation
                 to be written out is allocated at all.  Streaming
            across the object, looping
                 by rows and writing directly to file (e.g. from C)
            should be blisteringly
                 fast in comparison.

                 ~G

                 On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence
            <michafla at gene.com <mailto:michafla at gene.com>
                 <mailto:michafla at gene.com <mailto:michafla at gene.com>>>
            wrote:

                     Gabe is still testing/profiling, but we'll send
            something randomized
                     along eventually.

                     On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan
            <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
                     <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>> wrote:

                         I didn't see in the original thread a
            reproducible (simulated, I
                         guess) example, to be explicit about what the
            problem is??

                         Martin

                         On 08/26/2014 10:47 AM, Michael Lawrence wrote:

                             My understanding is that the heap
            optimization provided marginal
                             gains, and
                             that we need to think harder about how to
            optimize the all of
                             the string
                             manipulation in writeVcf. We either need to
            reduce it or reduce its
                             overhead (i.e., the CHARSXP allocation).
            Gabe is doing more tests.

                             On Tue, Aug 26, 2014 at 9:43 AM, Valerie
            Obenchain
                             <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>

                             wrote:

                                 Hi Gabe,

                                 Martin responded, and so did Michael,

            https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
            <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>

            <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
            <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>

                                 It sounded like Michael was ok with
            working with/around heap
                                 initialization.

                                 Michael, is that right or should we
            still consider this on
                                 the table?

                                 Val

                                 On 08/26/2014 09:34 AM, Gabe Becker wrote:

                                     Val,

                                     Has there been any movement on
            this? This remains a
                                     substantial
                                     bottleneck for us when writing very
            large VCF files (e.g.
                                     variants+genotypes for whole genome
            NGS samples).

                                     I was able to see a ~25% speedup
            with 4 cores and  an
                                     "optimal" speedup
                                     of ~2x with 10-12 cores for a VCF
            with 500k rows  using
                                     a very naive
                                     parallelization strategy and no
            other changes. I suspect
                                     this could be
                                     improved on quite a bit, or
            possibly made irrelevant
                                     with judicious use
                                     of serial C code.

                                     Did you and Martin make any plans
            regarding optimizing
                                     writeVcf?

                                     Best
                                     ~G

                                     On Tue, Aug 5, 2014 at 2:33 PM,
            Valerie Obenchain
                                     <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>
                                     <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>

                                     wrote:

                                           Hi Michael,

                                           I'm interested in working on
            this. I'll discuss
                                     with Martin next
                                           week when we're both back in
            the office.

                                           Val

                                           On 08/05/14 07:46, Michael
            Lawrence wrote:

                                               Hi guys (Val, Martin, Herve):

                                               Anyone have an itch for
            optimization? The
                                     writeVcf function is
                                               currently a
                                               bottleneck in our WGS
            genotyping pipeline. For
                                     a typical 50
                                               million row
                                               gVCF, it was taking 2.25
            hours prior to
                                     yesterday's improvements
                                               (pasteCollapseRows) that
            brought it down to
                                     about 1 hour, which
                                               is still
                                               too long by my standards
            (> 0). Only takes 3
                                     minutes to call the
                                               genotypes
                                               (and associated
            likelihoods etc) from the
                                     variant calls (using
                                               80 cores and
                                               450 GB RAM on one node),
            so the output is an
                                     issue. Profiling
                                               suggests that
                                               the running time scales
            non-linearly in the
                                     number of rows.

                                               Digging a little deeper,
            it seems to be
                                     something with R's
                                               string/memory
                                               allocation. Below,
            pasting 1 million strings
                                     takes 6 seconds, but
                                     10
                                               million strings takes
            over 2 minutes. It gets
                                     way worse with 50
                                               million. I
                                               suspect it has something
            to do with R's string
                                     hash table.

                                               set.seed(1000)
                                               end <- sample(1e8, 1e6)
                                               system.time(paste0("END",
            "=", end))
                                                    user  system elapsed
                                                   6.396   0.028   6.420

                                               end <- sample(1e8, 1e7)
                                               system.time(paste0("END",
            "=", end))
                                                    user  system elapsed
                                               134.714   0.352 134.978

                                               Indeed, even this takes a
            long time (in a
                                     fresh session):

                                               set.seed(1000)
                                               end <- sample(1e8, 1e6)
                                               end <- sample(1e8, 1e7)

            system.time(as.character(end))
                                                    user  system elapsed
                                                  57.224   0.156  57.366

                                               But running it a second
            time is faster (about
                                     what one would
                                               expect?):

                                               system.time(levels <-
            as.character(end))
                                                    user  system elapsed
                                                  23.582   0.021  23.589

                                               I did some simple
            profiling of R to find that
                                     the resizing of
                                               the string
                                               hash table is not a
            significant component of
                                     the time. So maybe
                                               something
                                               to do with the R heap/gc?
            No time right now to
                                     go deeper. But I
                                               know Martin
                                               likes this sort of thing ;)

                                               Michael

                                                        [[alternative
            HTML version deleted]]

            _____________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
                                     <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>
                                     <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                                     <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>
                                               mailing list

            https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>

            _____________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
                                     <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>
                                     <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                                     <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>> mailing
                                     list
            https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>

                                     --
                                     Computational Biologist
                                     Genentech Research

                                      [[alternative HTML version deleted]]

            ___________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>
                             mailing list
            https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>

                         --
                         Computational Biology / Fred Hutchinson Cancer
            Research Center
                         1100 Fairview Ave. N.
                         PO Box 19024 Seattle, WA 98109

                         Location: Arnold Building M1 B861
                         Phone: (206) 667-2793
            <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793>

                 --
                 Computational Biologist
                 Genentech Research

            --
            Computational Biologist
            Genentech Research

        --
        Computational Biology / Fred Hutchinson Cancer Research Center
        1100 Fairview Ave. N.
        PO Box 19024 Seattle, WA 98109

        Location: Arnold Building M1 B861
        Phone: (206) 667-2793 <tel:%28206%29%20667-2793>

--
Computational Biologist
Genentech Research
This approach, writing in chunks, is the same Herve and I used for writing
FASTA in the Biostrings package, although I see that Herve has now replaced
the R implementation with a C implementation.  I similarly found an
absolutely huge speed up when writing genomes, by chunking.

Best,
Kasper

On 08/27/2014 11:56 AM, Gabe Becker wrote:

The profiling I attached in my previous email is for 24 geno fields, as I
said,
but our typical usecase involves only ~4-6 fields, and is faster but
still on
the order of dozens of minutes.

I think Val is arriving at a (much) more efficient implementation, but...

I wanted to share my guess that the poor _scaling_ is because the garbage
collector runs multiple times as the different strings are pasted together,
and has to traverse, in linear time, increasing numbers of allocated SEXPs.
So times scale approximately quadratically with the number of rows in the
VCF

An efficiency is to reduce the number of SEXPs in play by writing out in
chunks -- as each chunk is written, the SEXPs become available for
collection and are re-used. Here's my toy example

time.R
======
splitIndices <- function (nx, ncl)
{
    i <- seq_len(nx)
    if (ncl == 0L)
        list()
    else if (ncl == 1L || nx == 1L)
        list(i)
    else {
        fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
        breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
        structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
    }
}

x = as.character(seq_len(1e7)); y = sample(x)
if (!is.na(Sys.getenv("SPLIT", NA))) {
    idx <- splitIndices(length(x), 20)
    system.time(for (i in idx) paste(x[i], y[i], sep=":"))
} else {
    system.time(paste(x, y, sep=":"))
}

running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the
relevant time is

   user  system elapsed
 15.320   0.064  15.381

versus with $ R --no-save --quiet -f time.R it is

   user  system elapsed
 95.360   0.164  95.511

I think this is likely an overall strategy when dealing with character
data -- processing in independent chunks of moderate (1M?) size (enabling
as a consequence parallel evaluation in modest memory) that are sufficient
to benefit from vectorization, but that do not entail allocation of large
numbers of in-use SEXPs.

Martin

Sorry for the confusion.
~G

On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker <beckerg4 at gene.com
<mailto:beckerg4 at gene.com>> wrote:

    Martin and Val.

    I re-ran writeVcf on our (G)VCF data (34790518 ranges, 24 geno
fields) with
    profiling enabled. The results of summaryRprof for that run are
attached,
    though for a variety of reasons they are pretty misleading.

    It took over an hour to write (3700+seconds), so it's definitely a
    bottleneck when the data get very large, even if it isn't for smaller
data.

    Michael and I both think the culprit is all the pasting and cbinding
that is
    going on, and more to the point, that memory for an internal
representation
    to be written out is allocated at all.  Streaming across the object,
looping
    by rows and writing directly to file (e.g. from C) should be
blisteringly
    fast in comparison.

    ~G

    On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence <michafla at gene.com
    <mailto:michafla at gene.com>> wrote:

        Gabe is still testing/profiling, but we'll send something
randomized
        along eventually.

        On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan <
mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>> wrote:

            I didn't see in the original thread a reproducible
(simulated, I
            guess) example, to be explicit about what the problem is??

            Martin

            On 08/26/2014 10:47 AM, Michael Lawrence wrote:

                My understanding is that the heap optimization provided
marginal
                gains, and
                that we need to think harder about how to optimize the
all of
                the string
                manipulation in writeVcf. We either need to reduce it or
reduce its
                overhead (i.e., the CHARSXP allocation). Gabe is doing
more tests.

                On Tue, Aug 26, 2014 at 9:43 AM, Valerie Obenchain
                <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
                wrote:

                    Hi Gabe,

                    Martin responded, and so did Michael,

                    https://stat.ethz.ch/__pipermail/bioc-devel/2014-__
August/006082.html

                    <https://stat.ethz.ch/pipermail/bioc-devel/2014-
August/006082.html>

                    It sounded like Michael was ok with working
with/around heap
                    initialization.

                    Michael, is that right or should we still consider
this on
                    the table?

                    Val

                    On 08/26/2014 09:34 AM, Gabe Becker wrote:

                        Val,

                        Has there been any movement on this? This remains
a
                        substantial
                        bottleneck for us when writing very large VCF
files (e.g.
                        variants+genotypes for whole genome NGS samples).

                        I was able to see a ~25% speedup with 4 cores
and  an
                        "optimal" speedup
                        of ~2x with 10-12 cores for a VCF with 500k rows
using
                        a very naive
                        parallelization strategy and no other changes. I
suspect
                        this could be
                        improved on quite a bit, or possibly made
irrelevant
                        with judicious use
                        of serial C code.

                        Did you and Martin make any plans regarding
optimizing
                        writeVcf?

                        Best
                        ~G

                        On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain
                        <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
                        <mailto:vobencha at fhcrc.org <mailto:
vobencha at fhcrc.org>>>

                        wrote:

                              Hi Michael,

                              I'm interested in working on this. I'll
discuss
                        with Martin next
                              week when we're both back in the office.

                              Val

                              On 08/05/14 07:46, Michael Lawrence wrote:

                                  Hi guys (Val, Martin, Herve):

                                  Anyone have an itch for optimization?
The
                        writeVcf function is
                                  currently a
                                  bottleneck in our WGS genotyping
pipeline. For
                        a typical 50
                                  million row
                                  gVCF, it was taking 2.25 hours prior to
                        yesterday's improvements
                                  (pasteCollapseRows) that brought it
down to
                        about 1 hour, which
                                  is still
                                  too long by my standards (> 0). Only
takes 3
                        minutes to call the
                                  genotypes
                                  (and associated likelihoods etc) from
the
                        variant calls (using
                                  80 cores and
                                  450 GB RAM on one node), so the output
is an
                        issue. Profiling
                                  suggests that
                                  the running time scales non-linearly in
the
                        number of rows.

                                  Digging a little deeper, it seems to be
                        something with R's
                                  string/memory
                                  allocation. Below, pasting 1 million
strings
                        takes 6 seconds, but
                        10
                                  million strings takes over 2 minutes.
It gets
                        way worse with 50
                                  million. I
                                  suspect it has something to do with R's
string
                        hash table.

                                  set.seed(1000)
                                  end <- sample(1e8, 1e6)
                                  system.time(paste0("END", "=", end))
                                       user  system elapsed
                                      6.396   0.028   6.420

                                  end <- sample(1e8, 1e7)
                                  system.time(paste0("END", "=", end))
                                       user  system elapsed
                                  134.714   0.352 134.978

                                  Indeed, even this takes a long time (in
a
                        fresh session):

                                  set.seed(1000)
                                  end <- sample(1e8, 1e6)
                                  end <- sample(1e8, 1e7)
                                  system.time(as.character(end))
                                       user  system elapsed
                                     57.224   0.156  57.366

                                  But running it a second time is faster
(about
                        what one would
                                  expect?):

                                  system.time(levels <- as.character(end))
                                       user  system elapsed
                                     23.582   0.021  23.589

                                  I did some simple profiling of R to
find that
                        the resizing of
                                  the string
                                  hash table is not a significant
component of
                        the time. So maybe
                                  something
                                  to do with the R heap/gc? No time right
now to
                        go deeper. But I
                                  know Martin
                                  likes this sort of thing ;)

                                  Michael

                                           [[alternative HTML version
deleted]]

                          ______________________________
_____________________
                        Bioc-devel at r-project.org
                        <mailto:Bioc-devel at r-project.org>
                        <mailto:Bioc-devel at r-project.__org
                        <mailto:Bioc-devel at r-project.org>>
                                  mailing list
                        https://stat.ethz.ch/mailman/_
___listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/
__listinfo/bioc-devel>

                          <https://stat.ethz.ch/mailman/
__listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/listinfo/bioc-devel

                              ______________________________
_____________________
                        Bioc-devel at r-project.org
                        <mailto:Bioc-devel at r-project.org>
                        <mailto:Bioc-devel at r-project.__org
                        <mailto:Bioc-devel at r-project.org>> mailing
                        list
                        https://stat.ethz.ch/mailman/_
___listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/
__listinfo/bioc-devel>

                          <https://stat.ethz.ch/mailman/
__listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/listinfo/bioc-devel

                        --
                        Computational Biologist
                        Genentech Research

                         [[alternative HTML version deleted]]

                _________________________________________________
                Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org

                mailing list
                https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

            --
            Computational Biology / Fred Hutchinson Cancer Research Center
            1100 Fairview Ave. N.
            PO Box 19024 Seattle, WA 98109

            Location: Arnold Building M1 B861
            Phone: (206) 667-2793 <tel:%28206%29%20667-2793>

    --
    Computational Biologist
    Genentech Research

--
Computational Biologist
Genentech Research

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now 
takes about 17 minutes.

 > hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

 > param = ScanVcfParam(info=NA)
 > vcf = readVcf(fl, "", param=param)
 > dim(vcf)
[1] 51612762        1

 > system.time(writeVcf(vcf, "out.vcf"))
     user   system  elapsed
  971.032    6.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't speed 
things up enough we were planning to implement 'chunking' through the 
VCF and/or move the parsing of info to C, however, it looks like geno 
was the bottleneck.

I've tested a number of samples/fields combinations in files with >= .5 
million rows and the improvement over writeVcf() in release is ~ 90%.

Valerie
Thanks Gabe. I should have something for you on Monday.

Val

On 09/04/2014 01:56 PM, Gabe Becker wrote:
Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into
R) stuff.

ftp://platgene:G3n3s4me at ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
took about ~4.2 hrs to write out, and is about 1.5x the size of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G

On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
<lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:

    Yes, it's very clear that the scaling is non-linear, and Gabe has
    been experimenting with a chunk-wise + parallel algorithm.
    Unfortunately there is some frustrating overhead with the
    parallelism. But I'm glad Val is arriving at something quicker.

    Michael

    On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan <mtmorgan at fhcrc.org
    <mailto:mtmorgan at fhcrc.org>> wrote:

        On 08/27/2014 11:56 AM, Gabe Becker wrote:

            The profiling I attached in my previous email is for 24 geno
            fields, as I said,
            but our typical usecase involves only ~4-6 fields, and is
            faster but still on
            the order of dozens of minutes.

        I think Val is arriving at a (much) more efficient
        implementation, but...

        I wanted to share my guess that the poor _scaling_ is because
        the garbage collector runs multiple times as the different
        strings are pasted together, and has to traverse, in linear
        time, increasing numbers of allocated SEXPs. So times scale
        approximately quadratically with the number of rows in the VCF

        An efficiency is to reduce the number of SEXPs in play by
        writing out in chunks -- as each chunk is written, the SEXPs
        become available for collection and are re-used. Here's my toy
        example

        time.R
        ======
        splitIndices <- function (nx, ncl)
        {
             i <- seq_len(nx)
             if (ncl == 0L)
                 list()
             else if (ncl == 1L || nx == 1L)
                 list(i)
             else {
                 fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
                 breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
                 structure(split(i, cut(i, breaks, labels=FALSE)), names
        = NULL)
             }
        }

        x = as.character(seq_len(1e7)); y = sample(x)
        if (!is.na <http://is.na>(Sys.getenv("SPLIT", NA))) {
             idx <- splitIndices(length(x), 20)
             system.time(for (i in idx) paste(x[i], y[i], sep=":"))
        } else {
             system.time(paste(x, y, sep=":"))
        }

        running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
        time.R the relevant time is

            user  system elapsed
          15.320   0.064  15.381

        versus with $ R --no-save --quiet -f time.R it is

            user  system elapsed
          95.360   0.164  95.511

        I think this is likely an overall strategy when dealing with
        character data -- processing in independent chunks of moderate
        (1M?) size (enabling as a consequence parallel evaluation in
        modest memory) that are sufficient to benefit from
        vectorization, but that do not entail allocation of large
        numbers of in-use SEXPs.

        Martin

            Sorry for the confusion.
            ~G

            On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
            <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
            <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>> wrote:

                 Martin and Val.

                 I re-ran writeVcf on our (G)VCF data (34790518 ranges,
            24 geno fields) with
                 profiling enabled. The results of summaryRprof for that
            run are attached,
                 though for a variety of reasons they are pretty
misleading.

                 It took over an hour to write (3700+seconds), so it's
            definitely a
                 bottleneck when the data get very large, even if it
            isn't for smaller data.

                 Michael and I both think the culprit is all the pasting
            and cbinding that is
                 going on, and more to the point, that memory for an
            internal representation
                 to be written out is allocated at all.  Streaming
            across the object, looping
                 by rows and writing directly to file (e.g. from C)
            should be blisteringly
                 fast in comparison.

                 ~G

                 On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence
            <michafla at gene.com <mailto:michafla at gene.com>
                 <mailto:michafla at gene.com <mailto:michafla at gene.com>>>
            wrote:

                     Gabe is still testing/profiling, but we'll send
            something randomized
                     along eventually.

                     On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan
            <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
                     <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>> wrote:

                         I didn't see in the original thread a
            reproducible (simulated, I
                         guess) example, to be explicit about what the
            problem is??

                         Martin

                         On 08/26/2014 10:47 AM, Michael Lawrence wrote:

                             My understanding is that the heap
            optimization provided marginal
                             gains, and
                             that we need to think harder about how to
            optimize the all of
                             the string
                             manipulation in writeVcf. We either need to
            reduce it or reduce its
                             overhead (i.e., the CHARSXP allocation).
            Gabe is doing more tests.

                             On Tue, Aug 26, 2014 at 9:43 AM, Valerie
            Obenchain
                             <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>

                             wrote:

                                 Hi Gabe,

                                 Martin responded, and so did Michael,

https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html

<https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>

                                 It sounded like Michael was ok with
            working with/around heap
                                 initialization.

                                 Michael, is that right or should we
            still consider this on
                                 the table?

                                 Val

                                 On 08/26/2014 09:34 AM, Gabe Becker
wrote:

                                     Val,

                                     Has there been any movement on
            this? This remains a
                                     substantial
                                     bottleneck for us when writing very
            large VCF files (e.g.
                                     variants+genotypes for whole genome
            NGS samples).

                                     I was able to see a ~25% speedup
            with 4 cores and  an
                                     "optimal" speedup
                                     of ~2x with 10-12 cores for a VCF
            with 500k rows  using
                                     a very naive
                                     parallelization strategy and no
            other changes. I suspect
                                     this could be
                                     improved on quite a bit, or
            possibly made irrelevant
                                     with judicious use
                                     of serial C code.

                                     Did you and Martin make any plans
            regarding optimizing
                                     writeVcf?

                                     Best
                                     ~G

                                     On Tue, Aug 5, 2014 at 2:33 PM,
            Valerie Obenchain
                                     <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>
                                     <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>

                                     wrote:

                                           Hi Michael,

                                           I'm interested in working on
            this. I'll discuss
                                     with Martin next
                                           week when we're both back in
            the office.

                                           Val

                                           On 08/05/14 07:46, Michael
            Lawrence wrote:

                                               Hi guys (Val, Martin,
Herve):

                                               Anyone have an itch for
            optimization? The
                                     writeVcf function is
                                               currently a
                                               bottleneck in our WGS
            genotyping pipeline. For
                                     a typical 50
                                               million row
                                               gVCF, it was taking 2.25
            hours prior to
                                     yesterday's improvements
                                               (pasteCollapseRows) that
            brought it down to
                                     about 1 hour, which
                                               is still
                                               too long by my standards
            (> 0). Only takes 3
                                     minutes to call the
                                               genotypes
                                               (and associated
            likelihoods etc) from the
                                     variant calls (using
                                               80 cores and
                                               450 GB RAM on one node),
            so the output is an
                                     issue. Profiling
                                               suggests that
                                               the running time scales
            non-linearly in the
                                     number of rows.

                                               Digging a little deeper,
            it seems to be
                                     something with R's
                                               string/memory
                                               allocation. Below,
            pasting 1 million strings
                                     takes 6 seconds, but
                                     10
                                               million strings takes
            over 2 minutes. It gets
                                     way worse with 50
                                               million. I
                                               suspect it has something
            to do with R's string
                                     hash table.

                                               set.seed(1000)
                                               end <- sample(1e8, 1e6)
                                               system.time(paste0("END",
            "=", end))
                                                    user  system elapsed
                                                   6.396   0.028   6.420

                                               end <- sample(1e8, 1e7)
                                               system.time(paste0("END",
            "=", end))
                                                    user  system elapsed
                                               134.714   0.352 134.978

                                               Indeed, even this takes a
            long time (in a
                                     fresh session):

                                               set.seed(1000)
                                               end <- sample(1e8, 1e6)
                                               end <- sample(1e8, 1e7)

            system.time(as.character(end))
                                                    user  system elapsed
                                                  57.224   0.156  57.366

                                               But running it a second
            time is faster (about
                                     what one would
                                               expect?):

                                               system.time(levels <-
            as.character(end))
                                                    user  system elapsed
                                                  23.582   0.021  23.589

                                               I did some simple
            profiling of R to find that
                                     the resizing of
                                               the string
                                               hash table is not a
            significant component of
                                     the time. So maybe
                                               something
                                               to do with the R heap/gc?
            No time right now to
                                     go deeper. But I
                                               know Martin
                                               likes this sort of
thing ;)

                                               Michael

                                                        [[alternative
            HTML version deleted]]

            _____________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
                                     <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>
                                     <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                                     <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>
                                               mailing list

            https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>

            _____________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
                                     <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>
                                     <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                                     <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>> mailing
                                     list
            https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>

                                     --
                                     Computational Biologist
                                     Genentech Research

                                      [[alternative HTML version
deleted]]

            ___________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>
                             mailing list
            https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>

                         --
                         Computational Biology / Fred Hutchinson Cancer
            Research Center
                         1100 Fairview Ave. N.
                         PO Box 19024 Seattle, WA 98109

                         Location: Arnold Building M1 B861
                         Phone: (206) 667-2793
            <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793>

                 --
                 Computational Biologist
                 Genentech Research

            --
            Computational Biologist
            Genentech Research

        --
        Computational Biology / Fred Hutchinson Cancer Research Center
        1100 Fairview Ave. N.
        PO Box 19024 Seattle, WA 98109

        Location: Arnold Building M1 B861
        Phone: (206) 667-2793 <tel:%28206%29%20667-2793>

--
Computational Biologist
Genentech Research

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:
The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now takes
about 17 minutes.

hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

param = ScanVcfParam(info=NA)
vcf = readVcf(fl, "", param=param)
dim(vcf)
[1] 51612762        1

system.time(writeVcf(vcf, "out.vcf"))
    user   system  elapsed
 971.032    6.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't speed
things up enough we were planning to implement 'chunking' through the VCF
and/or move the parsing of info to C, however, it looks like geno was the
bottleneck.

I've tested a number of samples/fields combinations in files with >= .5
million rows and the improvement over writeVcf() in release is ~ 90%.

Valerie

On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val

On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into
R) stuff.

ftp://platgene:G3n3s4me at ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
took about ~4.2 hrs to write out, and is about 1.5x the size of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G

On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
<lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:

    Yes, it's very clear that the scaling is non-linear, and Gabe has
    been experimenting with a chunk-wise + parallel algorithm.
    Unfortunately there is some frustrating overhead with the
    parallelism. But I'm glad Val is arriving at something quicker.

    Michael

    On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan <mtmorgan at fhcrc.org
    <mailto:mtmorgan at fhcrc.org>> wrote:

        On 08/27/2014 11:56 AM, Gabe Becker wrote:

            The profiling I attached in my previous email is for 24 geno
            fields, as I said,
            but our typical usecase involves only ~4-6 fields, and is
            faster but still on
            the order of dozens of minutes.

        I think Val is arriving at a (much) more efficient
        implementation, but...

        I wanted to share my guess that the poor _scaling_ is because
        the garbage collector runs multiple times as the different
        strings are pasted together, and has to traverse, in linear
        time, increasing numbers of allocated SEXPs. So times scale
        approximately quadratically with the number of rows in the VCF

        An efficiency is to reduce the number of SEXPs in play by
        writing out in chunks -- as each chunk is written, the SEXPs
        become available for collection and are re-used. Here's my toy
        example

        time.R
        ======
        splitIndices <- function (nx, ncl)
        {
             i <- seq_len(nx)
             if (ncl == 0L)
                 list()
             else if (ncl == 1L || nx == 1L)
                 list(i)
             else {
                 fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
                 breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
                 structure(split(i, cut(i, breaks, labels=FALSE)), names
        = NULL)
             }
        }

        x = as.character(seq_len(1e7)); y = sample(x)
        if (!is.na <http://is.na>(Sys.getenv("SPLIT", NA))) {
             idx <- splitIndices(length(x), 20)
             system.time(for (i in idx) paste(x[i], y[i], sep=":"))
        } else {
             system.time(paste(x, y, sep=":"))
        }

        running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
        time.R the relevant time is

            user  system elapsed
          15.320   0.064  15.381

        versus with $ R --no-save --quiet -f time.R it is

            user  system elapsed
          95.360   0.164  95.511

        I think this is likely an overall strategy when dealing with
        character data -- processing in independent chunks of moderate
        (1M?) size (enabling as a consequence parallel evaluation in
        modest memory) that are sufficient to benefit from
        vectorization, but that do not entail allocation of large
        numbers of in-use SEXPs.

        Martin

            Sorry for the confusion.
            ~G

            On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
            <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
            <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>>
wrote:

                 Martin and Val.

                 I re-ran writeVcf on our (G)VCF data (34790518 ranges,
            24 geno fields) with
                 profiling enabled. The results of summaryRprof for that
            run are attached,
                 though for a variety of reasons they are pretty
misleading.

                 It took over an hour to write (3700+seconds), so it's
            definitely a
                 bottleneck when the data get very large, even if it
            isn't for smaller data.

                 Michael and I both think the culprit is all the pasting
            and cbinding that is
                 going on, and more to the point, that memory for an
            internal representation
                 to be written out is allocated at all.  Streaming
            across the object, looping
                 by rows and writing directly to file (e.g. from C)
            should be blisteringly
                 fast in comparison.

                 ~G

                 On Tue, Aug 26, 2014 at 11:57 AM, Michael Lawrence
            <michafla at gene.com <mailto:michafla at gene.com>
                 <mailto:michafla at gene.com <mailto:michafla at gene.com>>>
            wrote:

                     Gabe is still testing/profiling, but we'll send
            something randomized
                     along eventually.

                     On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan
            <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
                     <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>> wrote:

                         I didn't see in the original thread a
            reproducible (simulated, I
                         guess) example, to be explicit about what the
            problem is??

                         Martin

                         On 08/26/2014 10:47 AM, Michael Lawrence wrote:

                             My understanding is that the heap
            optimization provided marginal
                             gains, and
                             that we need to think harder about how to
            optimize the all of
                             the string
                             manipulation in writeVcf. We either need to
            reduce it or reduce its
                             overhead (i.e., the CHARSXP allocation).
            Gabe is doing more tests.

                             On Tue, Aug 26, 2014 at 9:43 AM, Valerie
            Obenchain
                             <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>

                             wrote:

                                 Hi Gabe,

                                 Martin responded, and so did Michael,

https://stat.ethz.ch/____pipermail/bioc-devel/2014-____
August/006082.html

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html

<https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>

                                 It sounded like Michael was ok with
            working with/around heap
                                 initialization.

                                 Michael, is that right or should we
            still consider this on
                                 the table?

                                 Val

                                 On 08/26/2014 09:34 AM, Gabe Becker
wrote:

                                     Val,

                                     Has there been any movement on
            this? This remains a
                                     substantial
                                     bottleneck for us when writing very
            large VCF files (e.g.
                                     variants+genotypes for whole genome
            NGS samples).

                                     I was able to see a ~25% speedup
            with 4 cores and  an
                                     "optimal" speedup
                                     of ~2x with 10-12 cores for a VCF
            with 500k rows  using
                                     a very naive
                                     parallelization strategy and no
            other changes. I suspect
                                     this could be
                                     improved on quite a bit, or
            possibly made irrelevant
                                     with judicious use
                                     of serial C code.

                                     Did you and Martin make any plans
            regarding optimizing
                                     writeVcf?

                                     Best
                                     ~G

                                     On Tue, Aug 5, 2014 at 2:33 PM,
            Valerie Obenchain
                                     <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>
                                     <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>

                                     wrote:

                                           Hi Michael,

                                           I'm interested in working on
            this. I'll discuss
                                     with Martin next
                                           week when we're both back in
            the office.

                                           Val

                                           On 08/05/14 07:46, Michael
            Lawrence wrote:

                                               Hi guys (Val, Martin,
Herve):

                                               Anyone have an itch for
            optimization? The
                                     writeVcf function is
                                               currently a
                                               bottleneck in our WGS
            genotyping pipeline. For
                                     a typical 50
                                               million row
                                               gVCF, it was taking 2.25
            hours prior to
                                     yesterday's improvements
                                               (pasteCollapseRows) that
            brought it down to
                                     about 1 hour, which
                                               is still
                                               too long by my standards
            (> 0). Only takes 3
                                     minutes to call the
                                               genotypes
                                               (and associated
            likelihoods etc) from the
                                     variant calls (using
                                               80 cores and
                                               450 GB RAM on one node),
            so the output is an
                                     issue. Profiling
                                               suggests that
                                               the running time scales
            non-linearly in the
                                     number of rows.

                                               Digging a little deeper,
            it seems to be
                                     something with R's
                                               string/memory
                                               allocation. Below,
            pasting 1 million strings
                                     takes 6 seconds, but
                                     10
                                               million strings takes
            over 2 minutes. It gets
                                     way worse with 50
                                               million. I
                                               suspect it has something
            to do with R's string
                                     hash table.

                                               set.seed(1000)
                                               end <- sample(1e8, 1e6)
                                               system.time(paste0("END",
            "=", end))
                                                    user  system elapsed
                                                   6.396   0.028   6.420

                                               end <- sample(1e8, 1e7)
                                               system.time(paste0("END",
            "=", end))
                                                    user  system elapsed
                                               134.714   0.352 134.978

                                               Indeed, even this takes a
            long time (in a
                                     fresh session):

                                               set.seed(1000)
                                               end <- sample(1e8, 1e6)
                                               end <- sample(1e8, 1e7)

            system.time(as.character(end))
                                                    user  system elapsed
                                                  57.224   0.156  57.366

                                               But running it a second
            time is faster (about
                                     what one would
                                               expect?):

                                               system.time(levels <-
            as.character(end))
                                                    user  system elapsed
                                                  23.582   0.021  23.589

                                               I did some simple
            profiling of R to find that
                                     the resizing of
                                               the string
                                               hash table is not a
            significant component of
                                     the time. So maybe
                                               something
                                               to do with the R heap/gc?
            No time right now to
                                     go deeper. But I
                                               know Martin
                                               likes this sort of
thing ;)

                                               Michael

                                                        [[alternative
            HTML version deleted]]

            _____________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
                                     <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>
                                     <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                                     <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>
                                               mailing list

            https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>

            _____________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
                                     <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>
                                     <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                                     <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>> mailing
                                     list
            https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>

                                     --
                                     Computational Biologist
                                     Genentech Research

                                      [[alternative HTML version
deleted]]

            ___________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>
                             mailing list
            https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>

                         --
                         Computational Biology / Fred Hutchinson Cancer
            Research Center
                         1100 Fairview Ave. N.
                         PO Box 19024 Seattle, WA 98109

                         Location: Arnold Building M1 B861
                         Phone: (206) 667-2793
            <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793>

                 --
                 Computational Biologist
                 Genentech Research

            --
            Computational Biologist
            Genentech Research

        --
        Computational Biology / Fred Hutchinson Cancer Research Center
        1100 Fairview Ave. N.
        PO Box 19024 Seattle, WA 98109

        Location: Arnold Building M1 B861
        Phone: (206) 667-2793 <tel:%28206%29%20667-2793>

--
Computational Biologist
Genentech Research

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Computational Biologist
Genentech Research

	[[alternative HTML version deleted]]
fyi Martin found a bug with the treatment of list data (ie, Number = 
'.') in the header. Working on a fix ...

Val
Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain <vobencha at fhcrc.org
<mailto:vobencha at fhcrc.org>> wrote:

    The new writeVcf code is in 1.11.28.

    Using the illumina file you suggested, geno fields only, writing now
    takes about 17 minutes.

     > hdr
    class: VCFHeader
    samples(1): NA12877
    meta(6): fileformat ApplyRecalibration ... reference source
    fixed(1): FILTER
    info(22): AC AF ... culprit set
    geno(8): GT GQX ... PL VF

     > param = ScanVcfParam(info=NA)
     > vcf = readVcf(fl, "", param=param)
     > dim(vcf)
    [1] 51612762        1

     > system.time(writeVcf(vcf, "out.vcf"))
         user   system  elapsed
      971.032    6.568 1004.593

    In 1.11.28, parsing of geno data was moved to C. If this didn't
    speed things up enough we were planning to implement 'chunking'
    through the VCF and/or move the parsing of info to C, however, it
    looks like geno was the bottleneck.

    I've tested a number of samples/fields combinations in files with >=
    .5 million rows and the improvement over writeVcf() in release is ~ 90%.

    Valerie

    On 09/04/14 15:28, Valerie Obenchain wrote:

        Thanks Gabe. I should have something for you on Monday.

        Val

        On 09/04/2014 01:56 PM, Gabe Becker wrote:

            Val and Martin,

            Apologies for the delay.

            We realized that the Illumina platinum genome vcf files make
            a good test
            case, assuming you strip out all the info (info=NA when
            reading it into
            R) stuff.

            ftp://platgene:G3n3s4me at ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz
            <ftp://platgene:G3n3s4me at ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>
            took about ~4.2 hrs to write out, and is about 1.5x the size
            of the
            files we are actually dealing with (~50M ranges vs our ~30M).

            Looking forward a new vastly improved writeVcf :).

            ~G

            On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
            <lawrence.michael at gene.com
            <mailto:lawrence.michael at gene.com>
            <mailto:lawrence.michael at gene.__com
            <mailto:lawrence.michael at gene.com>>> wrote:

                 Yes, it's very clear that the scaling is non-linear,
            and Gabe has
                 been experimenting with a chunk-wise + parallel algorithm.
                 Unfortunately there is some frustrating overhead with the
                 parallelism. But I'm glad Val is arriving at something
            quicker.

                 Michael

                 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
            <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
                 <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>> wrote:

                     On 08/27/2014 11:56 AM, Gabe Becker wrote:

                         The profiling I attached in my previous email
            is for 24 geno
                         fields, as I said,
                         but our typical usecase involves only ~4-6
            fields, and is
                         faster but still on
                         the order of dozens of minutes.

                     I think Val is arriving at a (much) more efficient
                     implementation, but...

                     I wanted to share my guess that the poor _scaling_
            is because
                     the garbage collector runs multiple times as the
            different
                     strings are pasted together, and has to traverse,
            in linear
                     time, increasing numbers of allocated SEXPs. So
            times scale
                     approximately quadratically with the number of rows
            in the VCF

                     An efficiency is to reduce the number of SEXPs in
            play by
                     writing out in chunks -- as each chunk is written,
            the SEXPs
                     become available for collection and are re-used.
            Here's my toy
                     example

                     time.R
                     ======
                     splitIndices <- function (nx, ncl)
                     {
                          i <- seq_len(nx)
                          if (ncl == 0L)
                              list()
                          else if (ncl == 1L || nx == 1L)
                              list(i)
                          else {
                              fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
                              breaks <- seq(1 - fuzz, nx + fuzz, length
            = ncl + 1L)
                              structure(split(i, cut(i, breaks,
            labels=FALSE)), names
                     = NULL)
                          }
                     }

                     x = as.character(seq_len(1e7)); y = sample(x)
                     if (!is.na <http://is.na>
            <http://is.na>(Sys.getenv("__SPLIT", NA))) {
                          idx <- splitIndices(length(x), 20)
                          system.time(for (i in idx) paste(x[i], y[i],
            sep=":"))
                     } else {
                          system.time(paste(x, y, sep=":"))
                     }

                     running under R-devel with $ SPLIT=TRUE R --no-save
            --quiet -f
                     time.R the relevant time is

                         user  system elapsed
                       15.320   0.064  15.381

                     versus with $ R --no-save --quiet -f time.R it is

                         user  system elapsed
                       95.360   0.164  95.511

                     I think this is likely an overall strategy when
            dealing with
                     character data -- processing in independent chunks
            of moderate
                     (1M?) size (enabling as a consequence parallel
            evaluation in
                     modest memory) that are sufficient to benefit from
                     vectorization, but that do not entail allocation of
            large
                     numbers of in-use SEXPs.

                     Martin

                         Sorry for the confusion.
                         ~G

                         On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
                         <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
            <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>
                         <mailto:beckerg4 at gene.com
            <mailto:beckerg4 at gene.com> <mailto:beckerg4 at gene.com
            <mailto:beckerg4 at gene.com>>>> wrote:

                              Martin and Val.

                              I re-ran writeVcf on our (G)VCF data
            (34790518 ranges,
                         24 geno fields) with
                              profiling enabled. The results of
            summaryRprof for that
                         run are attached,
                              though for a variety of reasons they are
            pretty
            misleading.

                              It took over an hour to write
            (3700+seconds), so it's
                         definitely a
                              bottleneck when the data get very large,
            even if it
                         isn't for smaller data.

                              Michael and I both think the culprit is
            all the pasting
                         and cbinding that is
                              going on, and more to the point, that
            memory for an
                         internal representation
                              to be written out is allocated at all.
            Streaming
                         across the object, looping
                              by rows and writing directly to file (e.g.
            from C)
                         should be blisteringly
                              fast in comparison.

                              ~G

                              On Tue, Aug 26, 2014 at 11:57 AM, Michael
            Lawrence
                         <michafla at gene.com <mailto:michafla at gene.com>
            <mailto:michafla at gene.com <mailto:michafla at gene.com>>
                              <mailto:michafla at gene.com
            <mailto:michafla at gene.com> <mailto:michafla at gene.com
            <mailto:michafla at gene.com>>>>
                         wrote:

                                  Gabe is still testing/profiling, but
            we'll send
                         something randomized
                                  along eventually.

                                  On Tue, Aug 26, 2014 at 11:15 AM,
            Martin Morgan
                         <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
            <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
                                  <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>
                         <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>>> wrote:

                                      I didn't see in the original thread a
                         reproducible (simulated, I
                                      guess) example, to be explicit
            about what the
                         problem is??

                                      Martin

                                      On 08/26/2014 10:47 AM, Michael
            Lawrence wrote:

                                          My understanding is that the heap
                         optimization provided marginal
                                          gains, and
                                          that we need to think harder
            about how to
                         optimize the all of
                                          the string
                                          manipulation in writeVcf. We
            either need to
                         reduce it or reduce its
                                          overhead (i.e., the CHARSXP
            allocation).
                         Gabe is doing more tests.

                                          On Tue, Aug 26, 2014 at 9:43
            AM, Valerie
                         Obenchain
                                          <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>

                                          wrote:

                                              Hi Gabe,

                                              Martin responded, and so
            did Michael,

            https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html
            <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>

            <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
            <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>>

            <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
            <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>

            <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
            <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>>

                                              It sounded like Michael
            was ok with
                         working with/around heap
                                              initialization.

                                              Michael, is that right or
            should we
                         still consider this on
                                              the table?

                                              Val

                                              On 08/26/2014 09:34 AM,
            Gabe Becker
            wrote:

                                                  Val,

                                                  Has there been any
            movement on
                         this? This remains a
                                                  substantial
                                                  bottleneck for us when
            writing very
                         large VCF files (e.g.
                                                  variants+genotypes for
            whole genome
                         NGS samples).

                                                  I was able to see a
            ~25% speedup
                         with 4 cores and  an
                                                  "optimal" speedup
                                                  of ~2x with 10-12
            cores for a VCF
                         with 500k rows  using
                                                  a very naive
                                                  parallelization
            strategy and no
                         other changes. I suspect
                                                  this could be
                                                  improved on quite a
            bit, or
                         possibly made irrelevant
                                                  with judicious use
                                                  of serial C code.

                                                  Did you and Martin
            make any plans
                         regarding optimizing
                                                  writeVcf?

                                                  Best
                                                  ~G

                                                  On Tue, Aug 5, 2014 at
            2:33 PM,
                         Valerie Obenchain
                                                  <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>

              <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>>

                                                  wrote:

                                                        Hi Michael,

                                                        I'm interested
            in working on
                         this. I'll discuss
                                                  with Martin next
                                                        week when we're
            both back in
                         the office.

                                                        Val

                                                        On 08/05/14
            07:46, Michael
                         Lawrence wrote:

                                                            Hi guys
            (Val, Martin,
            Herve):

                                                            Anyone have
            an itch for
                         optimization? The
                                                  writeVcf function is
                                                            currently a
                                                            bottleneck
            in our WGS
                         genotyping pipeline. For
                                                  a typical 50
                                                            million row
                                                            gVCF, it was
            taking 2.25
                         hours prior to
                                                  yesterday's improvements

              (pasteCollapseRows) that
                         brought it down to
                                                  about 1 hour, which
                                                            is still
                                                            too long by
            my standards
                         (> 0). Only takes 3
                                                  minutes to call the
                                                            genotypes
                                                            (and associated
                         likelihoods etc) from the
                                                  variant calls (using
                                                            80 cores and
                                                            450 GB RAM
            on one node),
                         so the output is an
                                                  issue. Profiling
                                                            suggests that
                                                            the running
            time scales
                         non-linearly in the
                                                  number of rows.

                                                            Digging a
            little deeper,
                         it seems to be
                                                  something with R's
                                                            string/memory
                                                            allocation.
            Below,
                         pasting 1 million strings
                                                  takes 6 seconds, but
                                                  10
                                                            million
            strings takes
                         over 2 minutes. It gets
                                                  way worse with 50
                                                            million. I
                                                            suspect it
            has something
                         to do with R's string
                                                  hash table.

                                                            set.seed(1000)
                                                            end <-
            sample(1e8, 1e6)

              system.time(paste0("END",
                         "=", end))
                                                                 user
            system elapsed
                                                                6.396
              0.028   6.420

                                                            end <-
            sample(1e8, 1e7)

              system.time(paste0("END",
                         "=", end))
                                                                 user
            system elapsed
                                                            134.714
              0.352 134.978

                                                            Indeed, even
            this takes a
                         long time (in a
                                                  fresh session):

                                                            set.seed(1000)
                                                            end <-
            sample(1e8, 1e6)
                                                            end <-
            sample(1e8, 1e7)

                         system.time(as.character(end))
                                                                 user
            system elapsed
                                                               57.224
              0.156  57.366

                                                            But running
            it a second
                         time is faster (about
                                                  what one would
                                                            expect?):

              system.time(levels <-
                         as.character(end))
                                                                 user
            system elapsed
                                                               23.582
              0.021  23.589

                                                            I did some
            simple
                         profiling of R to find that
                                                  the resizing of
                                                            the string
                                                            hash table
            is not a
                         significant component of
                                                  the time. So maybe
                                                            something
                                                            to do with
            the R heap/gc?
                         No time right now to
                                                  go deeper. But I
                                                            know Martin
                                                            likes this
            sort of
            thing ;)

                                                            Michael

            [[alternative
                         HTML version deleted]]

            _______________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>

              <mailto:Bioc-devel at r-project <mailto:Bioc-devel at r-project>.
                         <mailto:Bioc-devel at r-project
            <mailto:Bioc-devel at r-project>.>______org

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>>
                                                            mailing list

            https://stat.ethz.ch/mailman/________listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>

            _______________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>

              <mailto:Bioc-devel at r-project <mailto:Bioc-devel at r-project>.
                         <mailto:Bioc-devel at r-project
            <mailto:Bioc-devel at r-project>.>______org

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>> mailing
                                                  list
            https://stat.ethz.ch/mailman/________listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>

                                                  --
                                                  Computational Biologist
                                                  Genentech Research

                                                   [[alternative HTML
            version
            deleted]]

            _____________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>
                         <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>
                                          mailing list
            https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>

                                      --
                                      Computational Biology / Fred
            Hutchinson Cancer
                         Research Center
                                      1100 Fairview Ave. N.
                                      PO Box 19024 Seattle, WA 98109

                                      Location: Arnold Building M1 B861
                                      Phone: (206) 667-2793
            <tel:%28206%29%20667-2793>
                         <tel:%28206%29%20667-2793>
            <tel:%28206%29%20667-2793>

                              --
                              Computational Biologist
                              Genentech Research

                         --
                         Computational Biologist
                         Genentech Research

                     --
                     Computational Biology / Fred Hutchinson Cancer
            Research Center
                     1100 Fairview Ave. N.
                     PO Box 19024 Seattle, WA 98109

                     Location: Arnold Building M1 B861
                     Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
            <tel:%28206%29%20667-2793>

            --
            Computational Biologist
            Genentech Research

        _________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

--
Computational Biologist
Genentech Research
Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the 
process of moving SimpleList and DataFrame from IRanges to S4Vectors; 
finished up today I think. Anyhow, if you get VariantAnnotation from svn 
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe 
rtracklayer).

I'm working on the 'chunking' approach next. It looks like we can still 
gain from adding that.

Valerie
fyi Martin found a bug with the treatment of list data (ie, Number =
'.') in the header. Working on a fix ...

Val

On 09/08/2014 08:43 AM, Gabe Becker wrote:
Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain <vobencha at fhcrc.org
<mailto:vobencha at fhcrc.org>> wrote:

    The new writeVcf code is in 1.11.28.

    Using the illumina file you suggested, geno fields only, writing now
    takes about 17 minutes.

     > hdr
    class: VCFHeader
    samples(1): NA12877
    meta(6): fileformat ApplyRecalibration ... reference source
    fixed(1): FILTER
    info(22): AC AF ... culprit set
    geno(8): GT GQX ... PL VF

     > param = ScanVcfParam(info=NA)
     > vcf = readVcf(fl, "", param=param)
     > dim(vcf)
    [1] 51612762        1

     > system.time(writeVcf(vcf, "out.vcf"))
         user   system  elapsed
      971.032    6.568 1004.593

    In 1.11.28, parsing of geno data was moved to C. If this didn't
    speed things up enough we were planning to implement 'chunking'
    through the VCF and/or move the parsing of info to C, however, it
    looks like geno was the bottleneck.

    I've tested a number of samples/fields combinations in files with >=
    .5 million rows and the improvement over writeVcf() in release is
~ 90%.

    Valerie

    On 09/04/14 15:28, Valerie Obenchain wrote:

        Thanks Gabe. I should have something for you on Monday.

        Val

        On 09/04/2014 01:56 PM, Gabe Becker wrote:

            Val and Martin,

            Apologies for the delay.

            We realized that the Illumina platinum genome vcf files make
            a good test
            case, assuming you strip out all the info (info=NA when
            reading it into
            R) stuff.

ftp://platgene:G3n3s4me at ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz

<ftp://platgene:G3n3s4me at ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>
            took about ~4.2 hrs to write out, and is about 1.5x the size
            of the
            files we are actually dealing with (~50M ranges vs our ~30M).

            Looking forward a new vastly improved writeVcf :).

            ~G

            On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
            <lawrence.michael at gene.com
            <mailto:lawrence.michael at gene.com>
            <mailto:lawrence.michael at gene.__com
            <mailto:lawrence.michael at gene.com>>> wrote:

                 Yes, it's very clear that the scaling is non-linear,
            and Gabe has
                 been experimenting with a chunk-wise + parallel
algorithm.
                 Unfortunately there is some frustrating overhead with
the
                 parallelism. But I'm glad Val is arriving at something
            quicker.

                 Michael

                 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
            <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
                 <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>> wrote:

                     On 08/27/2014 11:56 AM, Gabe Becker wrote:

                         The profiling I attached in my previous email
            is for 24 geno
                         fields, as I said,
                         but our typical usecase involves only ~4-6
            fields, and is
                         faster but still on
                         the order of dozens of minutes.

                     I think Val is arriving at a (much) more efficient
                     implementation, but...

                     I wanted to share my guess that the poor _scaling_
            is because
                     the garbage collector runs multiple times as the
            different
                     strings are pasted together, and has to traverse,
            in linear
                     time, increasing numbers of allocated SEXPs. So
            times scale
                     approximately quadratically with the number of rows
            in the VCF

                     An efficiency is to reduce the number of SEXPs in
            play by
                     writing out in chunks -- as each chunk is written,
            the SEXPs
                     become available for collection and are re-used.
            Here's my toy
                     example

                     time.R
                     ======
                     splitIndices <- function (nx, ncl)
                     {
                          i <- seq_len(nx)
                          if (ncl == 0L)
                              list()
                          else if (ncl == 1L || nx == 1L)
                              list(i)
                          else {
                              fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
                              breaks <- seq(1 - fuzz, nx + fuzz, length
            = ncl + 1L)
                              structure(split(i, cut(i, breaks,
            labels=FALSE)), names
                     = NULL)
                          }
                     }

                     x = as.character(seq_len(1e7)); y = sample(x)
                     if (!is.na <http://is.na>
            <http://is.na>(Sys.getenv("__SPLIT", NA))) {
                          idx <- splitIndices(length(x), 20)
                          system.time(for (i in idx) paste(x[i], y[i],
            sep=":"))
                     } else {
                          system.time(paste(x, y, sep=":"))
                     }

                     running under R-devel with $ SPLIT=TRUE R --no-save
            --quiet -f
                     time.R the relevant time is

                         user  system elapsed
                       15.320   0.064  15.381

                     versus with $ R --no-save --quiet -f time.R it is

                         user  system elapsed
                       95.360   0.164  95.511

                     I think this is likely an overall strategy when
            dealing with
                     character data -- processing in independent chunks
            of moderate
                     (1M?) size (enabling as a consequence parallel
            evaluation in
                     modest memory) that are sufficient to benefit from
                     vectorization, but that do not entail allocation of
            large
                     numbers of in-use SEXPs.

                     Martin

                         Sorry for the confusion.
                         ~G

                         On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
                         <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
            <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>
                         <mailto:beckerg4 at gene.com
            <mailto:beckerg4 at gene.com> <mailto:beckerg4 at gene.com
            <mailto:beckerg4 at gene.com>>>> wrote:

                              Martin and Val.

                              I re-ran writeVcf on our (G)VCF data
            (34790518 ranges,
                         24 geno fields) with
                              profiling enabled. The results of
            summaryRprof for that
                         run are attached,
                              though for a variety of reasons they are
            pretty
            misleading.

                              It took over an hour to write
            (3700+seconds), so it's
                         definitely a
                              bottleneck when the data get very large,
            even if it
                         isn't for smaller data.

                              Michael and I both think the culprit is
            all the pasting
                         and cbinding that is
                              going on, and more to the point, that
            memory for an
                         internal representation
                              to be written out is allocated at all.
            Streaming
                         across the object, looping
                              by rows and writing directly to file (e.g.
            from C)
                         should be blisteringly
                              fast in comparison.

                              ~G

                              On Tue, Aug 26, 2014 at 11:57 AM, Michael
            Lawrence
                         <michafla at gene.com <mailto:michafla at gene.com>
            <mailto:michafla at gene.com <mailto:michafla at gene.com>>
                              <mailto:michafla at gene.com
            <mailto:michafla at gene.com> <mailto:michafla at gene.com
            <mailto:michafla at gene.com>>>>
                         wrote:

                                  Gabe is still testing/profiling, but
            we'll send
                         something randomized
                                  along eventually.

                                  On Tue, Aug 26, 2014 at 11:15 AM,
            Martin Morgan
                         <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
            <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
                                  <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>
                         <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>>> wrote:

                                      I didn't see in the original
thread a
                         reproducible (simulated, I
                                      guess) example, to be explicit
            about what the
                         problem is??

                                      Martin

                                      On 08/26/2014 10:47 AM, Michael
            Lawrence wrote:

                                          My understanding is that the
heap
                         optimization provided marginal
                                          gains, and
                                          that we need to think harder
            about how to
                         optimize the all of
                                          the string
                                          manipulation in writeVcf. We
            either need to
                         reduce it or reduce its
                                          overhead (i.e., the CHARSXP
            allocation).
                         Gabe is doing more tests.

                                          On Tue, Aug 26, 2014 at 9:43
            AM, Valerie
                         Obenchain
                                          <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>

                                          wrote:

                                              Hi Gabe,

                                              Martin responded, and so
            did Michael,

https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html

<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>

<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>>

<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html

<https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>>

                                              It sounded like Michael
            was ok with
                         working with/around heap
                                              initialization.

                                              Michael, is that right or
            should we
                         still consider this on
                                              the table?

                                              Val

                                              On 08/26/2014 09:34 AM,
            Gabe Becker
            wrote:

                                                  Val,

                                                  Has there been any
            movement on
                         this? This remains a
                                                  substantial
                                                  bottleneck for us when
            writing very
                         large VCF files (e.g.
                                                  variants+genotypes for
            whole genome
                         NGS samples).

                                                  I was able to see a
            ~25% speedup
                         with 4 cores and  an
                                                  "optimal" speedup
                                                  of ~2x with 10-12
            cores for a VCF
                         with 500k rows  using
                                                  a very naive
                                                  parallelization
            strategy and no
                         other changes. I suspect
                                                  this could be
                                                  improved on quite a
            bit, or
                         possibly made irrelevant
                                                  with judicious use
                                                  of serial C code.

                                                  Did you and Martin
            make any plans
                         regarding optimizing
                                                  writeVcf?

                                                  Best
                                                  ~G

                                                  On Tue, Aug 5, 2014 at
            2:33 PM,
                         Valerie Obenchain
                                                  <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>

              <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>>

                                                  wrote:

                                                        Hi Michael,

                                                        I'm interested
            in working on
                         this. I'll discuss
                                                  with Martin next
                                                        week when we're
            both back in
                         the office.

                                                        Val

                                                        On 08/05/14
            07:46, Michael
                         Lawrence wrote:

                                                            Hi guys
            (Val, Martin,
            Herve):

                                                            Anyone have
            an itch for
                         optimization? The
                                                  writeVcf function is
                                                            currently a
                                                            bottleneck
            in our WGS
                         genotyping pipeline. For
                                                  a typical 50
                                                            million row
                                                            gVCF, it was
            taking 2.25
                         hours prior to
                                                  yesterday's
improvements

              (pasteCollapseRows) that
                         brought it down to
                                                  about 1 hour, which
                                                            is still
                                                            too long by
            my standards
                         (> 0). Only takes 3
                                                  minutes to call the
                                                            genotypes
                                                            (and
associated
                         likelihoods etc) from the
                                                  variant calls (using
                                                            80 cores and
                                                            450 GB RAM
            on one node),
                         so the output is an
                                                  issue. Profiling
                                                            suggests that
                                                            the running
            time scales
                         non-linearly in the
                                                  number of rows.

                                                            Digging a
            little deeper,
                         it seems to be
                                                  something with R's
                                                            string/memory
                                                            allocation.
            Below,
                         pasting 1 million strings
                                                  takes 6 seconds, but
                                                  10
                                                            million
            strings takes
                         over 2 minutes. It gets
                                                  way worse with 50
                                                            million. I
                                                            suspect it
            has something
                         to do with R's string
                                                  hash table.

set.seed(1000)
                                                            end <-
            sample(1e8, 1e6)

              system.time(paste0("END",
                         "=", end))
                                                                 user
            system elapsed
                                                                6.396
              0.028   6.420

                                                            end <-
            sample(1e8, 1e7)

              system.time(paste0("END",
                         "=", end))
                                                                 user
            system elapsed
                                                            134.714
              0.352 134.978

                                                            Indeed, even
            this takes a
                         long time (in a
                                                  fresh session):

set.seed(1000)
                                                            end <-
            sample(1e8, 1e6)
                                                            end <-
            sample(1e8, 1e7)

                         system.time(as.character(end))
                                                                 user
            system elapsed
                                                               57.224
              0.156  57.366

                                                            But running
            it a second
                         time is faster (about
                                                  what one would
                                                            expect?):

              system.time(levels <-
                         as.character(end))
                                                                 user
            system elapsed
                                                               23.582
              0.021  23.589

                                                            I did some
            simple
                         profiling of R to find that
                                                  the resizing of
                                                            the string
                                                            hash table
            is not a
                         significant component of
                                                  the time. So maybe
                                                            something
                                                            to do with
            the R heap/gc?
                         No time right now to
                                                  go deeper. But I
                                                            know Martin
                                                            likes this
            sort of
            thing ;)

                                                            Michael

            [[alternative
                         HTML version deleted]]

            _______________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>

              <mailto:Bioc-devel at r-project <mailto:Bioc-devel at r-project>.
                         <mailto:Bioc-devel at r-project
            <mailto:Bioc-devel at r-project>.>______org

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>>
                                                            mailing list

            https://stat.ethz.ch/mailman/________listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>

            _______________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>

              <mailto:Bioc-devel at r-project <mailto:Bioc-devel at r-project>.
                         <mailto:Bioc-devel at r-project
            <mailto:Bioc-devel at r-project>.>______org

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>> mailing
                                                  list
            https://stat.ethz.ch/mailman/________listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>

                                                  --
                                                  Computational Biologist
                                                  Genentech Research

                                                   [[alternative HTML
            version
            deleted]]

            _____________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>
                         <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>
                                          mailing list
            https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>

                                      --
                                      Computational Biology / Fred
            Hutchinson Cancer
                         Research Center
                                      1100 Fairview Ave. N.
                                      PO Box 19024 Seattle, WA 98109

                                      Location: Arnold Building M1 B861
                                      Phone: (206) 667-2793
            <tel:%28206%29%20667-2793>
                         <tel:%28206%29%20667-2793>
            <tel:%28206%29%20667-2793>

                              --
                              Computational Biologist
                              Genentech Research

                         --
                         Computational Biologist
                         Genentech Research

                     --
                     Computational Biology / Fred Hutchinson Cancer
            Research Center
                     1100 Fairview Ave. N.
                     PO Box 19024 Seattle, WA 98109

                     Location: Arnold Building M1 B861
                     Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
            <tel:%28206%29%20667-2793>

            --
            Computational Biologist
            Genentech Research

        _________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

--
Computational Biologist
Genentech Research

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
Hi Val,
Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the
process of moving SimpleList and DataFrame from IRanges to S4Vectors;
finished up today I think.
I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check'
failed for me because of an unit test error in test_VRanges_vcf().
Here is how to quickly reproduce:

   library(VariantAnnotation)
   library(RUnit)
   source("path/to/VariantAnnotation/inst/unitTests/test_VRanges-class.R")

   dest <- tempfile()
   vr <- make_TARGET_VRanges_vcf()
   writeVcf(vr, dest)
   vcf <- readVcf(dest, genome = "hg19")
   perm <- c(1, 7, 8, 4, 2, 10)
   vcf.vr <- as(vcf, "VRanges")[perm]
   genome(vr) <- "hg19"
   checkIdenticalVCF(vr, vcf.vr)  # Error in checkIdentical(orig, vcf) : 
FALSE

Hard for me to tell whether this is related to DataFrame moving from 
IRanges to S4Vectors or to a regression in writeVcf(). Do you think
you can have a look? Thanks for the help and sorry for the trouble.

H.
Anyhow, if you get VariantAnnotation from svn
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe
rtracklayer).

I'm working on the 'chunking' approach next. It looks like we can still
gain from adding that.

Valerie

On 09/08/2014 12:00 PM, Valerie Obenchain wrote:
fyi Martin found a bug with the treatment of list data (ie, Number =
'.') in the header. Working on a fix ...

Val

On 09/08/2014 08:43 AM, Gabe Becker wrote:
Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain <vobencha at fhcrc.org
<mailto:vobencha at fhcrc.org>> wrote:

    The new writeVcf code is in 1.11.28.

    Using the illumina file you suggested, geno fields only, writing now
    takes about 17 minutes.

     > hdr
    class: VCFHeader
    samples(1): NA12877
    meta(6): fileformat ApplyRecalibration ... reference source
    fixed(1): FILTER
    info(22): AC AF ... culprit set
    geno(8): GT GQX ... PL VF

     > param = ScanVcfParam(info=NA)
     > vcf = readVcf(fl, "", param=param)
     > dim(vcf)
    [1] 51612762        1

     > system.time(writeVcf(vcf, "out.vcf"))
         user   system  elapsed
      971.032    6.568 1004.593

    In 1.11.28, parsing of geno data was moved to C. If this didn't
    speed things up enough we were planning to implement 'chunking'
    through the VCF and/or move the parsing of info to C, however, it
    looks like geno was the bottleneck.

    I've tested a number of samples/fields combinations in files with >=
    .5 million rows and the improvement over writeVcf() in release is
~ 90%.

    Valerie

    On 09/04/14 15:28, Valerie Obenchain wrote:

        Thanks Gabe. I should have something for you on Monday.

        Val

        On 09/04/2014 01:56 PM, Gabe Becker wrote:

            Val and Martin,

            Apologies for the delay.

            We realized that the Illumina platinum genome vcf files make
            a good test
            case, assuming you strip out all the info (info=NA when
            reading it into
            R) stuff.

ftp://platgene:G3n3s4me at ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz

<ftp://platgene:G3n3s4me at ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>
            took about ~4.2 hrs to write out, and is about 1.5x the size
            of the
            files we are actually dealing with (~50M ranges vs our
~30M).

            Looking forward a new vastly improved writeVcf :).

            ~G

            On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
            <lawrence.michael at gene.com
            <mailto:lawrence.michael at gene.com>
            <mailto:lawrence.michael at gene.__com
            <mailto:lawrence.michael at gene.com>>> wrote:

                 Yes, it's very clear that the scaling is non-linear,
            and Gabe has
                 been experimenting with a chunk-wise + parallel
algorithm.
                 Unfortunately there is some frustrating overhead with
the
                 parallelism. But I'm glad Val is arriving at something
            quicker.

                 Michael

                 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
            <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
                 <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>> wrote:

                     On 08/27/2014 11:56 AM, Gabe Becker wrote:

                         The profiling I attached in my previous email
            is for 24 geno
                         fields, as I said,
                         but our typical usecase involves only ~4-6
            fields, and is
                         faster but still on
                         the order of dozens of minutes.

                     I think Val is arriving at a (much) more efficient
                     implementation, but...

                     I wanted to share my guess that the poor _scaling_
            is because
                     the garbage collector runs multiple times as the
            different
                     strings are pasted together, and has to traverse,
            in linear
                     time, increasing numbers of allocated SEXPs. So
            times scale
                     approximately quadratically with the number of rows
            in the VCF

                     An efficiency is to reduce the number of SEXPs in
            play by
                     writing out in chunks -- as each chunk is written,
            the SEXPs
                     become available for collection and are re-used.
            Here's my toy
                     example

                     time.R
                     ======
                     splitIndices <- function (nx, ncl)
                     {
                          i <- seq_len(nx)
                          if (ncl == 0L)
                              list()
                          else if (ncl == 1L || nx == 1L)
                              list(i)
                          else {
                              fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
                              breaks <- seq(1 - fuzz, nx + fuzz, length
            = ncl + 1L)
                              structure(split(i, cut(i, breaks,
            labels=FALSE)), names
                     = NULL)
                          }
                     }

                     x = as.character(seq_len(1e7)); y = sample(x)
                     if (!is.na <http://is.na>
            <http://is.na>(Sys.getenv("__SPLIT", NA))) {
                          idx <- splitIndices(length(x), 20)
                          system.time(for (i in idx) paste(x[i], y[i],
            sep=":"))
                     } else {
                          system.time(paste(x, y, sep=":"))
                     }

                     running under R-devel with $ SPLIT=TRUE R --no-save
            --quiet -f
                     time.R the relevant time is

                         user  system elapsed
                       15.320   0.064  15.381

                     versus with $ R --no-save --quiet -f time.R it is

                         user  system elapsed
                       95.360   0.164  95.511

                     I think this is likely an overall strategy when
            dealing with
                     character data -- processing in independent chunks
            of moderate
                     (1M?) size (enabling as a consequence parallel
            evaluation in
                     modest memory) that are sufficient to benefit from
                     vectorization, but that do not entail allocation of
            large
                     numbers of in-use SEXPs.

                     Martin

                         Sorry for the confusion.
                         ~G

                         On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
                         <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
            <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>
                         <mailto:beckerg4 at gene.com
            <mailto:beckerg4 at gene.com> <mailto:beckerg4 at gene.com
            <mailto:beckerg4 at gene.com>>>> wrote:

                              Martin and Val.

                              I re-ran writeVcf on our (G)VCF data
            (34790518 ranges,
                         24 geno fields) with
                              profiling enabled. The results of
            summaryRprof for that
                         run are attached,
                              though for a variety of reasons they are
            pretty
            misleading.

                              It took over an hour to write
            (3700+seconds), so it's
                         definitely a
                              bottleneck when the data get very large,
            even if it
                         isn't for smaller data.

                              Michael and I both think the culprit is
            all the pasting
                         and cbinding that is
                              going on, and more to the point, that
            memory for an
                         internal representation
                              to be written out is allocated at all.
            Streaming
                         across the object, looping
                              by rows and writing directly to file (e.g.
            from C)
                         should be blisteringly
                              fast in comparison.

                              ~G

                              On Tue, Aug 26, 2014 at 11:57 AM, Michael
            Lawrence
                         <michafla at gene.com <mailto:michafla at gene.com>
            <mailto:michafla at gene.com <mailto:michafla at gene.com>>
                              <mailto:michafla at gene.com
            <mailto:michafla at gene.com> <mailto:michafla at gene.com
            <mailto:michafla at gene.com>>>>
                         wrote:

                                  Gabe is still testing/profiling, but
            we'll send
                         something randomized
                                  along eventually.

                                  On Tue, Aug 26, 2014 at 11:15 AM,
            Martin Morgan
                         <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
            <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
                                  <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>
                         <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>>> wrote:

                                      I didn't see in the original
thread a
                         reproducible (simulated, I
                                      guess) example, to be explicit
            about what the
                         problem is??

                                      Martin

                                      On 08/26/2014 10:47 AM, Michael
            Lawrence wrote:

                                          My understanding is that the
heap
                         optimization provided marginal
                                          gains, and
                                          that we need to think harder
            about how to
                         optimize the all of
                                          the string
                                          manipulation in writeVcf. We
            either need to
                         reduce it or reduce its
                                          overhead (i.e., the CHARSXP
            allocation).
                         Gabe is doing more tests.

                                          On Tue, Aug 26, 2014 at 9:43
            AM, Valerie
                         Obenchain
                                          <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>

                                          wrote:

                                              Hi Gabe,

                                              Martin responded, and so
            did Michael,

https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html

<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>

<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>>

<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html

<https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>>

                                              It sounded like Michael
            was ok with
                         working with/around heap
                                              initialization.

                                              Michael, is that right or
            should we
                         still consider this on
                                              the table?

                                              Val

                                              On 08/26/2014 09:34 AM,
            Gabe Becker
            wrote:

                                                  Val,

                                                  Has there been any
            movement on
                         this? This remains a
                                                  substantial
                                                  bottleneck for us when
            writing very
                         large VCF files (e.g.
                                                  variants+genotypes for
            whole genome
                         NGS samples).

                                                  I was able to see a
            ~25% speedup
                         with 4 cores and  an
                                                  "optimal" speedup
                                                  of ~2x with 10-12
            cores for a VCF
                         with 500k rows  using
                                                  a very naive
                                                  parallelization
            strategy and no
                         other changes. I suspect
                                                  this could be
                                                  improved on quite a
            bit, or
                         possibly made irrelevant
                                                  with judicious use
                                                  of serial C code.

                                                  Did you and Martin
            make any plans
                         regarding optimizing
                                                  writeVcf?

                                                  Best
                                                  ~G

                                                  On Tue, Aug 5, 2014 at
            2:33 PM,
                         Valerie Obenchain
                                                  <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>

              <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>>

                                                  wrote:

                                                        Hi Michael,

                                                        I'm interested
            in working on
                         this. I'll discuss
                                                  with Martin next
                                                        week when we're
            both back in
                         the office.

                                                        Val

                                                        On 08/05/14
            07:46, Michael
                         Lawrence wrote:

                                                            Hi guys
            (Val, Martin,
            Herve):

                                                            Anyone have
            an itch for
                         optimization? The
                                                  writeVcf function is
                                                            currently a
                                                            bottleneck
            in our WGS
                         genotyping pipeline. For
                                                  a typical 50
                                                            million row
                                                            gVCF, it was
            taking 2.25
                         hours prior to
                                                  yesterday's
improvements

              (pasteCollapseRows) that
                         brought it down to
                                                  about 1 hour, which
                                                            is still
                                                            too long by
            my standards
                         (> 0). Only takes 3
                                                  minutes to call the
                                                            genotypes
                                                            (and
associated
                         likelihoods etc) from the
                                                  variant calls (using
                                                            80 cores and
                                                            450 GB RAM
            on one node),
                         so the output is an
                                                  issue. Profiling
                                                            suggests
that
                                                            the running
            time scales
                         non-linearly in the
                                                  number of rows.

                                                            Digging a
            little deeper,
                         it seems to be
                                                  something with R's

string/memory
                                                            allocation.
            Below,
                         pasting 1 million strings
                                                  takes 6 seconds, but
                                                  10
                                                            million
            strings takes
                         over 2 minutes. It gets
                                                  way worse with 50
                                                            million. I
                                                            suspect it
            has something
                         to do with R's string
                                                  hash table.

set.seed(1000)
                                                            end <-
            sample(1e8, 1e6)

              system.time(paste0("END",
                         "=", end))
                                                                 user
            system elapsed
                                                                6.396
              0.028   6.420

                                                            end <-
            sample(1e8, 1e7)

              system.time(paste0("END",
                         "=", end))
                                                                 user
            system elapsed
                                                            134.714
              0.352 134.978

                                                            Indeed, even
            this takes a
                         long time (in a
                                                  fresh session):

set.seed(1000)
                                                            end <-
            sample(1e8, 1e6)
                                                            end <-
            sample(1e8, 1e7)

                         system.time(as.character(end))
                                                                 user
            system elapsed
                                                               57.224
              0.156  57.366

                                                            But running
            it a second
                         time is faster (about
                                                  what one would
                                                            expect?):

              system.time(levels <-
                         as.character(end))
                                                                 user
            system elapsed
                                                               23.582
              0.021  23.589

                                                            I did some
            simple
                         profiling of R to find that
                                                  the resizing of
                                                            the string
                                                            hash table
            is not a
                         significant component of
                                                  the time. So maybe
                                                            something
                                                            to do with
            the R heap/gc?
                         No time right now to
                                                  go deeper. But I
                                                            know Martin
                                                            likes this
            sort of
            thing ;)

                                                            Michael

            [[alternative
                         HTML version deleted]]

            _______________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>

              <mailto:Bioc-devel at r-project
<mailto:Bioc-devel at r-project>.
                         <mailto:Bioc-devel at r-project
            <mailto:Bioc-devel at r-project>.>______org

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>>
                                                            mailing list

            https://stat.ethz.ch/mailman/________listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>

            _______________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>

              <mailto:Bioc-devel at r-project
<mailto:Bioc-devel at r-project>.
                         <mailto:Bioc-devel at r-project
            <mailto:Bioc-devel at r-project>.>______org

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>> mailing
                                                  list
            https://stat.ethz.ch/mailman/________listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>

                                                  --
                                                  Computational
Biologist
                                                  Genentech Research

                                                   [[alternative HTML
            version
            deleted]]

            _____________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>
                         <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>
                                          mailing list
            https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>

                                      --
                                      Computational Biology / Fred
            Hutchinson Cancer
                         Research Center
                                      1100 Fairview Ave. N.
                                      PO Box 19024 Seattle, WA 98109

                                      Location: Arnold Building M1 B861
                                      Phone: (206) 667-2793
            <tel:%28206%29%20667-2793>
                         <tel:%28206%29%20667-2793>
            <tel:%28206%29%20667-2793>

                              --
                              Computational Biologist
                              Genentech Research

                         --
                         Computational Biologist
                         Genentech Research

                     --
                     Computational Biology / Fred Hutchinson Cancer
            Research Center
                     1100 Fairview Ave. N.
                     PO Box 19024 Seattle, WA 98109

                     Location: Arnold Building M1 B861
                     Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
            <tel:%28206%29%20667-2793>

            --
            Computational Biologist
            Genentech Research

        _________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

--
Computational Biologist
Genentech Research

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319
Hi Herve,

This unit test passes in VA 1.11.30 (the current version in svn). It was 
related to writeVcf(), not the IRanges/S4Vector stuff. My fault, not yours.

Val
Hi Val,

On 09/09/2014 02:12 PM, Valerie Obenchain wrote:
Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the
process of moving SimpleList and DataFrame from IRanges to S4Vectors;
finished up today I think.
I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check'
failed for me because of an unit test error in test_VRanges_vcf().
Here is how to quickly reproduce:

   library(VariantAnnotation)
   library(RUnit)
   source("path/to/VariantAnnotation/inst/unitTests/test_VRanges-class.R")

   dest <- tempfile()
   vr <- make_TARGET_VRanges_vcf()
   writeVcf(vr, dest)
   vcf <- readVcf(dest, genome = "hg19")
   perm <- c(1, 7, 8, 4, 2, 10)
   vcf.vr <- as(vcf, "VRanges")[perm]
   genome(vr) <- "hg19"
   checkIdenticalVCF(vr, vcf.vr)  # Error in checkIdentical(orig, vcf) :
FALSE

Hard for me to tell whether this is related to DataFrame moving from
IRanges to S4Vectors or to a regression in writeVcf(). Do you think
you can have a look? Thanks for the help and sorry for the trouble.

H.

Anyhow, if you get VariantAnnotation from svn
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe
rtracklayer).

I'm working on the 'chunking' approach next. It looks like we can still
gain from adding that.

Valerie

On 09/08/2014 12:00 PM, Valerie Obenchain wrote:
fyi Martin found a bug with the treatment of list data (ie, Number =
'.') in the header. Working on a fix ...

Val

On 09/08/2014 08:43 AM, Gabe Becker wrote:
Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain <vobencha at fhcrc.org
<mailto:vobencha at fhcrc.org>> wrote:

    The new writeVcf code is in 1.11.28.

    Using the illumina file you suggested, geno fields only, writing
now
    takes about 17 minutes.

     > hdr
    class: VCFHeader
    samples(1): NA12877
    meta(6): fileformat ApplyRecalibration ... reference source
    fixed(1): FILTER
    info(22): AC AF ... culprit set
    geno(8): GT GQX ... PL VF

     > param = ScanVcfParam(info=NA)
     > vcf = readVcf(fl, "", param=param)
     > dim(vcf)
    [1] 51612762        1

     > system.time(writeVcf(vcf, "out.vcf"))
         user   system  elapsed
      971.032    6.568 1004.593

    In 1.11.28, parsing of geno data was moved to C. If this didn't
    speed things up enough we were planning to implement 'chunking'
    through the VCF and/or move the parsing of info to C, however, it
    looks like geno was the bottleneck.

    I've tested a number of samples/fields combinations in files
with >=
    .5 million rows and the improvement over writeVcf() in release is
~ 90%.

    Valerie

    On 09/04/14 15:28, Valerie Obenchain wrote:

        Thanks Gabe. I should have something for you on Monday.

        Val

        On 09/04/2014 01:56 PM, Gabe Becker wrote:

            Val and Martin,

            Apologies for the delay.

            We realized that the Illumina platinum genome vcf files
make
            a good test
            case, assuming you strip out all the info (info=NA when
            reading it into
            R) stuff.

ftp://platgene:G3n3s4me at ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz

<ftp://platgene:G3n3s4me at ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>

            took about ~4.2 hrs to write out, and is about 1.5x the
size
            of the
            files we are actually dealing with (~50M ranges vs our
~30M).

            Looking forward a new vastly improved writeVcf :).

            ~G

            On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
            <lawrence.michael at gene.com
            <mailto:lawrence.michael at gene.com>
            <mailto:lawrence.michael at gene.__com
            <mailto:lawrence.michael at gene.com>>> wrote:

                 Yes, it's very clear that the scaling is non-linear,
            and Gabe has
                 been experimenting with a chunk-wise + parallel
algorithm.
                 Unfortunately there is some frustrating overhead with
the
                 parallelism. But I'm glad Val is arriving at something
            quicker.

                 Michael

                 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
            <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
                 <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>> wrote:

                     On 08/27/2014 11:56 AM, Gabe Becker wrote:

                         The profiling I attached in my previous email
            is for 24 geno
                         fields, as I said,
                         but our typical usecase involves only ~4-6
            fields, and is
                         faster but still on
                         the order of dozens of minutes.

                     I think Val is arriving at a (much) more efficient
                     implementation, but...

                     I wanted to share my guess that the poor _scaling_
            is because
                     the garbage collector runs multiple times as the
            different
                     strings are pasted together, and has to traverse,
            in linear
                     time, increasing numbers of allocated SEXPs. So
            times scale
                     approximately quadratically with the number of
rows
            in the VCF

                     An efficiency is to reduce the number of SEXPs in
            play by
                     writing out in chunks -- as each chunk is written,
            the SEXPs
                     become available for collection and are re-used.
            Here's my toy
                     example

                     time.R
                     ======
                     splitIndices <- function (nx, ncl)
                     {
                          i <- seq_len(nx)
                          if (ncl == 0L)
                              list()
                          else if (ncl == 1L || nx == 1L)
                              list(i)
                          else {
                              fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
                              breaks <- seq(1 - fuzz, nx + fuzz, length
            = ncl + 1L)
                              structure(split(i, cut(i, breaks,
            labels=FALSE)), names
                     = NULL)
                          }
                     }

                     x = as.character(seq_len(1e7)); y = sample(x)
                     if (!is.na <http://is.na>
            <http://is.na>(Sys.getenv("__SPLIT", NA))) {
                          idx <- splitIndices(length(x), 20)
                          system.time(for (i in idx) paste(x[i], y[i],
            sep=":"))
                     } else {
                          system.time(paste(x, y, sep=":"))
                     }

                     running under R-devel with $ SPLIT=TRUE R
--no-save
            --quiet -f
                     time.R the relevant time is

                         user  system elapsed
                       15.320   0.064  15.381

                     versus with $ R --no-save --quiet -f time.R it is

                         user  system elapsed
                       95.360   0.164  95.511

                     I think this is likely an overall strategy when
            dealing with
                     character data -- processing in independent chunks
            of moderate
                     (1M?) size (enabling as a consequence parallel
            evaluation in
                     modest memory) that are sufficient to benefit from
                     vectorization, but that do not entail
allocation of
            large
                     numbers of in-use SEXPs.

                     Martin

                         Sorry for the confusion.
                         ~G

                         On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
                         <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
            <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>
                         <mailto:beckerg4 at gene.com
            <mailto:beckerg4 at gene.com> <mailto:beckerg4 at gene.com
            <mailto:beckerg4 at gene.com>>>> wrote:

                              Martin and Val.

                              I re-ran writeVcf on our (G)VCF data
            (34790518 ranges,
                         24 geno fields) with
                              profiling enabled. The results of
            summaryRprof for that
                         run are attached,
                              though for a variety of reasons they are
            pretty
            misleading.

                              It took over an hour to write
            (3700+seconds), so it's
                         definitely a
                              bottleneck when the data get very large,
            even if it
                         isn't for smaller data.

                              Michael and I both think the culprit is
            all the pasting
                         and cbinding that is
                              going on, and more to the point, that
            memory for an
                         internal representation
                              to be written out is allocated at all.
            Streaming
                         across the object, looping
                              by rows and writing directly to file
(e.g.
            from C)
                         should be blisteringly
                              fast in comparison.

                              ~G

                              On Tue, Aug 26, 2014 at 11:57 AM, Michael
            Lawrence
                         <michafla at gene.com <mailto:michafla at gene.com>
            <mailto:michafla at gene.com <mailto:michafla at gene.com>>
                              <mailto:michafla at gene.com
            <mailto:michafla at gene.com> <mailto:michafla at gene.com
            <mailto:michafla at gene.com>>>>
                         wrote:

                                  Gabe is still testing/profiling, but
            we'll send
                         something randomized
                                  along eventually.

                                  On Tue, Aug 26, 2014 at 11:15 AM,
            Martin Morgan
                         <mtmorgan at fhcrc.org
<mailto:mtmorgan at fhcrc.org>
            <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
                                  <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>
                         <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>>> wrote:

                                      I didn't see in the original
thread a
                         reproducible (simulated, I
                                      guess) example, to be explicit
            about what the
                         problem is??

                                      Martin

                                      On 08/26/2014 10:47 AM, Michael
            Lawrence wrote:

                                          My understanding is that the
heap
                         optimization provided marginal
                                          gains, and
                                          that we need to think harder
            about how to
                         optimize the all of
                                          the string
                                          manipulation in writeVcf. We
            either need to
                         reduce it or reduce its
                                          overhead (i.e., the CHARSXP
            allocation).
                         Gabe is doing more tests.

                                          On Tue, Aug 26, 2014 at 9:43
            AM, Valerie
                         Obenchain
                                          <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>

                                          wrote:

                                              Hi Gabe,

                                              Martin responded, and so
            did Michael,

https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html

<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>

<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>>

<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html

<https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>>

                                              It sounded like Michael
            was ok with
                         working with/around heap
                                              initialization.

                                              Michael, is that right or
            should we
                         still consider this on
                                              the table?

                                              Val

                                              On 08/26/2014 09:34 AM,
            Gabe Becker
            wrote:

                                                  Val,

                                                  Has there been any
            movement on
                         this? This remains a
                                                  substantial
                                                  bottleneck for us
when
            writing very
                         large VCF files (e.g.
                                                  variants+genotypes
for
            whole genome
                         NGS samples).

                                                  I was able to see a
            ~25% speedup
                         with 4 cores and  an
                                                  "optimal" speedup
                                                  of ~2x with 10-12
            cores for a VCF
                         with 500k rows  using
                                                  a very naive
                                                  parallelization
            strategy and no
                         other changes. I suspect
                                                  this could be
                                                  improved on quite a
            bit, or
                         possibly made irrelevant
                                                  with judicious use
                                                  of serial C code.

                                                  Did you and Martin
            make any plans
                         regarding optimizing
                                                  writeVcf?

                                                  Best
                                                  ~G

                                                  On Tue, Aug 5,
2014 at
            2:33 PM,
                         Valerie Obenchain
                                                  <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>

              <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>>

                                                  wrote:

                                                        Hi Michael,

                                                        I'm interested
            in working on
                         this. I'll discuss
                                                  with Martin next
                                                        week when we're
            both back in
                         the office.

                                                        Val

                                                        On 08/05/14
            07:46, Michael
                         Lawrence wrote:

                                                            Hi guys
            (Val, Martin,
            Herve):

                                                            Anyone have
            an itch for
                         optimization? The
                                                  writeVcf function is
                                                            currently a
                                                            bottleneck
            in our WGS
                         genotyping pipeline. For
                                                  a typical 50
                                                            million row
                                                            gVCF, it
was
            taking 2.25
                         hours prior to
                                                  yesterday's
improvements

              (pasteCollapseRows) that
                         brought it down to
                                                  about 1 hour, which
                                                            is still
                                                            too long by
            my standards
                         (> 0). Only takes 3
                                                  minutes to call the
                                                            genotypes
                                                            (and
associated
                         likelihoods etc) from the
                                                  variant calls (using
                                                            80 cores
and
                                                            450 GB RAM
            on one node),
                         so the output is an
                                                  issue. Profiling
                                                            suggests
that
                                                            the running
            time scales
                         non-linearly in the
                                                  number of rows.

                                                            Digging a
            little deeper,
                         it seems to be
                                                  something with R's

string/memory
                                                            allocation.
            Below,
                         pasting 1 million strings
                                                  takes 6 seconds, but
                                                  10
                                                            million
            strings takes
                         over 2 minutes. It gets
                                                  way worse with 50
                                                            million. I
                                                            suspect it
            has something
                         to do with R's string
                                                  hash table.

set.seed(1000)
                                                            end <-
            sample(1e8, 1e6)

              system.time(paste0("END",
                         "=", end))
                                                                 user
            system elapsed
                                                                6.396
              0.028   6.420

                                                            end <-
            sample(1e8, 1e7)

              system.time(paste0("END",
                         "=", end))
                                                                 user
            system elapsed
                                                            134.714
              0.352 134.978

                                                            Indeed,
even
            this takes a
                         long time (in a
                                                  fresh session):

set.seed(1000)
                                                            end <-
            sample(1e8, 1e6)
                                                            end <-
            sample(1e8, 1e7)

                         system.time(as.character(end))
                                                                 user
            system elapsed
                                                               57.224
              0.156  57.366

                                                            But running
            it a second
                         time is faster (about
                                                  what one would
                                                            expect?):

              system.time(levels <-
                         as.character(end))
                                                                 user
            system elapsed
                                                               23.582
              0.021  23.589

                                                            I did some
            simple
                         profiling of R to find that
                                                  the resizing of
                                                            the string
                                                            hash table
            is not a
                         significant component of
                                                  the time. So maybe
                                                            something
                                                            to do with
            the R heap/gc?
                         No time right now to
                                                  go deeper. But I
                                                            know Martin
                                                            likes this
            sort of
            thing ;)

                                                            Michael

            [[alternative
                         HTML version deleted]]

            _______________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>

              <mailto:Bioc-devel at r-project
<mailto:Bioc-devel at r-project>.
                         <mailto:Bioc-devel at r-project
            <mailto:Bioc-devel at r-project>.>______org

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>>
                                                            mailing
list

            https://stat.ethz.ch/mailman/________listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>

            _______________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>

              <mailto:Bioc-devel at r-project
<mailto:Bioc-devel at r-project>.
                         <mailto:Bioc-devel at r-project
            <mailto:Bioc-devel at r-project>.>______org

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>> mailing
                                                  list
            https://stat.ethz.ch/mailman/________listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>

                                                  --
                                                  Computational
Biologist
                                                  Genentech Research

                                                   [[alternative HTML
            version
            deleted]]

            _____________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>
                         <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>
                                          mailing list
            https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>

                                      --
                                      Computational Biology / Fred
            Hutchinson Cancer
                         Research Center
                                      1100 Fairview Ave. N.
                                      PO Box 19024 Seattle, WA 98109

                                      Location: Arnold Building M1 B861
                                      Phone: (206) 667-2793
            <tel:%28206%29%20667-2793>
                         <tel:%28206%29%20667-2793>
            <tel:%28206%29%20667-2793>

                              --
                              Computational Biologist
                              Genentech Research

                         --
                         Computational Biologist
                         Genentech Research

                     --
                     Computational Biology / Fred Hutchinson Cancer
            Research Center
                     1100 Fairview Ave. N.
                     PO Box 19024 Seattle, WA 98109

                     Location: Arnold Building M1 B861
                     Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
            <tel:%28206%29%20667-2793>

            --
            Computational Biologist
            Genentech Research

        _________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

--
Computational Biologist
Genentech Research

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fhcrc.org
Phone: (206) 667-3158
Ah, ok. I should have 'svn up' and re-tried 'R CMD check' before
reporting this. Thanks and sorry for the noise.

H.
Hi Herve,

This unit test passes in VA 1.11.30 (the current version in svn). It was
related to writeVcf(), not the IRanges/S4Vector stuff. My fault, not yours.

Val

On 09/09/2014 02:47 PM, Herv? Pag?s wrote:
Hi Val,

On 09/09/2014 02:12 PM, Valerie Obenchain wrote:
Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the
process of moving SimpleList and DataFrame from IRanges to S4Vectors;
finished up today I think.
I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check'
failed for me because of an unit test error in test_VRanges_vcf().
Here is how to quickly reproduce:

   library(VariantAnnotation)
   library(RUnit)

source("path/to/VariantAnnotation/inst/unitTests/test_VRanges-class.R")

   dest <- tempfile()
   vr <- make_TARGET_VRanges_vcf()
   writeVcf(vr, dest)
   vcf <- readVcf(dest, genome = "hg19")
   perm <- c(1, 7, 8, 4, 2, 10)
   vcf.vr <- as(vcf, "VRanges")[perm]
   genome(vr) <- "hg19"
   checkIdenticalVCF(vr, vcf.vr)  # Error in checkIdentical(orig, vcf) :
FALSE

Hard for me to tell whether this is related to DataFrame moving from
IRanges to S4Vectors or to a regression in writeVcf(). Do you think
you can have a look? Thanks for the help and sorry for the trouble.

H.

Anyhow, if you get VariantAnnotation from svn
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe
rtracklayer).

I'm working on the 'chunking' approach next. It looks like we can still
gain from adding that.

Valerie

On 09/08/2014 12:00 PM, Valerie Obenchain wrote:
fyi Martin found a bug with the treatment of list data (ie, Number =
'.') in the header. Working on a fix ...

Val

On 09/08/2014 08:43 AM, Gabe Becker wrote:
Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain <vobencha at fhcrc.org
<mailto:vobencha at fhcrc.org>> wrote:

    The new writeVcf code is in 1.11.28.

    Using the illumina file you suggested, geno fields only, writing
now
    takes about 17 minutes.

     > hdr
    class: VCFHeader
    samples(1): NA12877
    meta(6): fileformat ApplyRecalibration ... reference source
    fixed(1): FILTER
    info(22): AC AF ... culprit set
    geno(8): GT GQX ... PL VF

     > param = ScanVcfParam(info=NA)
     > vcf = readVcf(fl, "", param=param)
     > dim(vcf)
    [1] 51612762        1

     > system.time(writeVcf(vcf, "out.vcf"))
         user   system  elapsed
      971.032    6.568 1004.593

    In 1.11.28, parsing of geno data was moved to C. If this didn't
    speed things up enough we were planning to implement 'chunking'
    through the VCF and/or move the parsing of info to C, however, it
    looks like geno was the bottleneck.

    I've tested a number of samples/fields combinations in files
with >=
    .5 million rows and the improvement over writeVcf() in release is
~ 90%.

    Valerie

    On 09/04/14 15:28, Valerie Obenchain wrote:

        Thanks Gabe. I should have something for you on Monday.

        Val

        On 09/04/2014 01:56 PM, Gabe Becker wrote:

            Val and Martin,

            Apologies for the delay.

            We realized that the Illumina platinum genome vcf files
make
            a good test
            case, assuming you strip out all the info (info=NA when
            reading it into
            R) stuff.

ftp://platgene:G3n3s4me at ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz

<ftp://platgene:G3n3s4me at ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>

            took about ~4.2 hrs to write out, and is about 1.5x the
size
            of the
            files we are actually dealing with (~50M ranges vs our
~30M).

            Looking forward a new vastly improved writeVcf :).

            ~G

            On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
            <lawrence.michael at gene.com
            <mailto:lawrence.michael at gene.com>
            <mailto:lawrence.michael at gene.__com
            <mailto:lawrence.michael at gene.com>>> wrote:

                 Yes, it's very clear that the scaling is non-linear,
            and Gabe has
                 been experimenting with a chunk-wise + parallel
algorithm.
                 Unfortunately there is some frustrating overhead with
the
                 parallelism. But I'm glad Val is arriving at
something
            quicker.

                 Michael

                 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
            <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
                 <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>> wrote:

                     On 08/27/2014 11:56 AM, Gabe Becker wrote:

                         The profiling I attached in my previous email
            is for 24 geno
                         fields, as I said,
                         but our typical usecase involves only ~4-6
            fields, and is
                         faster but still on
                         the order of dozens of minutes.

                     I think Val is arriving at a (much) more
efficient
                     implementation, but...

                     I wanted to share my guess that the poor
_scaling_
            is because
                     the garbage collector runs multiple times as the
            different
                     strings are pasted together, and has to traverse,
            in linear
                     time, increasing numbers of allocated SEXPs. So
            times scale
                     approximately quadratically with the number of
rows
            in the VCF

                     An efficiency is to reduce the number of SEXPs in
            play by
                     writing out in chunks -- as each chunk is
written,
            the SEXPs
                     become available for collection and are re-used.
            Here's my toy
                     example

                     time.R
                     ======
                     splitIndices <- function (nx, ncl)
                     {
                          i <- seq_len(nx)
                          if (ncl == 0L)
                              list()
                          else if (ncl == 1L || nx == 1L)
                              list(i)
                          else {
                              fuzz <- min((nx - 1L)/1000, 0.4 *
nx/ncl)
                              breaks <- seq(1 - fuzz, nx + fuzz,
length
            = ncl + 1L)
                              structure(split(i, cut(i, breaks,
            labels=FALSE)), names
                     = NULL)
                          }
                     }

                     x = as.character(seq_len(1e7)); y = sample(x)
                     if (!is.na <http://is.na>
            <http://is.na>(Sys.getenv("__SPLIT", NA))) {
                          idx <- splitIndices(length(x), 20)
                          system.time(for (i in idx) paste(x[i], y[i],
            sep=":"))
                     } else {
                          system.time(paste(x, y, sep=":"))
                     }

                     running under R-devel with $ SPLIT=TRUE R
--no-save
            --quiet -f
                     time.R the relevant time is

                         user  system elapsed
                       15.320   0.064  15.381

                     versus with $ R --no-save --quiet -f time.R it is

                         user  system elapsed
                       95.360   0.164  95.511

                     I think this is likely an overall strategy when
            dealing with
                     character data -- processing in independent
chunks
            of moderate
                     (1M?) size (enabling as a consequence parallel
            evaluation in
                     modest memory) that are sufficient to benefit
from
                     vectorization, but that do not entail
allocation of
            large
                     numbers of in-use SEXPs.

                     Martin

                         Sorry for the confusion.
                         ~G

                         On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
                         <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
            <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>
                         <mailto:beckerg4 at gene.com
            <mailto:beckerg4 at gene.com> <mailto:beckerg4 at gene.com
            <mailto:beckerg4 at gene.com>>>> wrote:

                              Martin and Val.

                              I re-ran writeVcf on our (G)VCF data
            (34790518 ranges,
                         24 geno fields) with
                              profiling enabled. The results of
            summaryRprof for that
                         run are attached,
                              though for a variety of reasons they are
            pretty
            misleading.

                              It took over an hour to write
            (3700+seconds), so it's
                         definitely a
                              bottleneck when the data get very large,
            even if it
                         isn't for smaller data.

                              Michael and I both think the culprit is
            all the pasting
                         and cbinding that is
                              going on, and more to the point, that
            memory for an
                         internal representation
                              to be written out is allocated at all.
            Streaming
                         across the object, looping
                              by rows and writing directly to file
(e.g.
            from C)
                         should be blisteringly
                              fast in comparison.

                              ~G

                              On Tue, Aug 26, 2014 at 11:57 AM,
Michael
            Lawrence
                         <michafla at gene.com <mailto:michafla at gene.com>
            <mailto:michafla at gene.com <mailto:michafla at gene.com>>
                              <mailto:michafla at gene.com
            <mailto:michafla at gene.com> <mailto:michafla at gene.com
            <mailto:michafla at gene.com>>>>
                         wrote:

                                  Gabe is still testing/profiling, but
            we'll send
                         something randomized
                                  along eventually.

                                  On Tue, Aug 26, 2014 at 11:15 AM,
            Martin Morgan
                         <mtmorgan at fhcrc.org
<mailto:mtmorgan at fhcrc.org>
            <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
                                  <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>
                         <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>>> wrote:

                                      I didn't see in the original
thread a
                         reproducible (simulated, I
                                      guess) example, to be explicit
            about what the
                         problem is??

                                      Martin

                                      On 08/26/2014 10:47 AM, Michael
            Lawrence wrote:

                                          My understanding is that the
heap
                         optimization provided marginal
                                          gains, and
                                          that we need to think harder
            about how to
                         optimize the all of
                                          the string
                                          manipulation in writeVcf. We
            either need to
                         reduce it or reduce its
                                          overhead (i.e., the CHARSXP
            allocation).
                         Gabe is doing more tests.

                                          On Tue, Aug 26, 2014 at 9:43
            AM, Valerie
                         Obenchain
                                          <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>

                                          wrote:

                                              Hi Gabe,

                                              Martin responded, and so
            did Michael,

https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html

<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>

<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>>

<https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html

<https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>>

                                              It sounded like Michael
            was ok with
                         working with/around heap
                                              initialization.

                                              Michael, is that
right or
            should we
                         still consider this on
                                              the table?

                                              Val

                                              On 08/26/2014 09:34 AM,
            Gabe Becker
            wrote:

                                                  Val,

                                                  Has there been any
            movement on
                         this? This remains a
                                                  substantial
                                                  bottleneck for us
when
            writing very
                         large VCF files (e.g.
                                                  variants+genotypes
for
            whole genome
                         NGS samples).

                                                  I was able to see a
            ~25% speedup
                         with 4 cores and  an
                                                  "optimal" speedup
                                                  of ~2x with 10-12
            cores for a VCF
                         with 500k rows  using
                                                  a very naive
                                                  parallelization
            strategy and no
                         other changes. I suspect
                                                  this could be
                                                  improved on quite a
            bit, or
                         possibly made irrelevant
                                                  with judicious use
                                                  of serial C code.

                                                  Did you and Martin
            make any plans
                         regarding optimizing
                                                  writeVcf?

                                                  Best
                                                  ~G

                                                  On Tue, Aug 5,
2014 at
            2:33 PM,
                         Valerie Obenchain
                                                  <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>

              <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>>

                                                  wrote:

                                                        Hi Michael,

                                                        I'm interested
            in working on
                         this. I'll discuss
                                                  with Martin next
                                                        week when
we're
            both back in
                         the office.

                                                        Val

                                                        On 08/05/14
            07:46, Michael
                         Lawrence wrote:

                                                            Hi guys
            (Val, Martin,
            Herve):

                                                            Anyone
have
            an itch for
                         optimization? The
                                                  writeVcf function is

currently a
                                                            bottleneck
            in our WGS
                         genotyping pipeline. For
                                                  a typical 50
                                                            million
row
                                                            gVCF, it
was
            taking 2.25
                         hours prior to
                                                  yesterday's
improvements

              (pasteCollapseRows) that
                         brought it down to
                                                  about 1 hour, which
                                                            is still
                                                            too
long by
            my standards
                         (> 0). Only takes 3
                                                  minutes to call the
                                                            genotypes
                                                            (and
associated
                         likelihoods etc) from the
                                                  variant calls (using
                                                            80 cores
and
                                                            450 GB RAM
            on one node),
                         so the output is an
                                                  issue. Profiling
                                                            suggests
that
                                                            the
running
            time scales
                         non-linearly in the
                                                  number of rows.

                                                            Digging a
            little deeper,
                         it seems to be
                                                  something with R's

string/memory

allocation.
            Below,
                         pasting 1 million strings
                                                  takes 6 seconds, but
                                                  10
                                                            million
            strings takes
                         over 2 minutes. It gets
                                                  way worse with 50
                                                            million. I
                                                            suspect it
            has something
                         to do with R's string
                                                  hash table.

set.seed(1000)
                                                            end <-
            sample(1e8, 1e6)

              system.time(paste0("END",
                         "=", end))
                                                                 user
            system elapsed
                                                                6.396
              0.028   6.420

                                                            end <-
            sample(1e8, 1e7)

              system.time(paste0("END",
                         "=", end))
                                                                 user
            system elapsed
                                                            134.714
              0.352 134.978

                                                            Indeed,
even
            this takes a
                         long time (in a
                                                  fresh session):

set.seed(1000)
                                                            end <-
            sample(1e8, 1e6)
                                                            end <-
            sample(1e8, 1e7)

                         system.time(as.character(end))
                                                                 user
            system elapsed
                                                               57.224
              0.156  57.366

                                                            But
running
            it a second
                         time is faster (about
                                                  what one would
                                                            expect?):

              system.time(levels <-
                         as.character(end))
                                                                 user
            system elapsed
                                                               23.582
              0.021  23.589

                                                            I did some
            simple
                         profiling of R to find that
                                                  the resizing of
                                                            the string
                                                            hash table
            is not a
                         significant component of
                                                  the time. So maybe
                                                            something
                                                            to do with
            the R heap/gc?
                         No time right now to
                                                  go deeper. But I
                                                            know
Martin
                                                            likes this
            sort of
            thing ;)

                                                            Michael

            [[alternative
                         HTML version deleted]]

            _______________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>

              <mailto:Bioc-devel at r-project
<mailto:Bioc-devel at r-project>.
                         <mailto:Bioc-devel at r-project
            <mailto:Bioc-devel at r-project>.>______org

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>>
                                                            mailing
list

            https://stat.ethz.ch/mailman/________listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>

            _______________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>

              <mailto:Bioc-devel at r-project
<mailto:Bioc-devel at r-project>.
                         <mailto:Bioc-devel at r-project
            <mailto:Bioc-devel at r-project>.>______org

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>> mailing
                                                  list
            https://stat.ethz.ch/mailman/________listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>

                                                  --
                                                  Computational
Biologist
                                                  Genentech Research

                                                   [[alternative HTML
            version
            deleted]]

            _____________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>
                         <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>
                                          mailing list
            https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>

                                      --
                                      Computational Biology / Fred
            Hutchinson Cancer
                         Research Center
                                      1100 Fairview Ave. N.
                                      PO Box 19024 Seattle, WA 98109

                                      Location: Arnold Building M1
B861
                                      Phone: (206) 667-2793
            <tel:%28206%29%20667-2793>
                         <tel:%28206%29%20667-2793>
            <tel:%28206%29%20667-2793>

                              --
                              Computational Biologist
                              Genentech Research

                         --
                         Computational Biologist
                         Genentech Research

                     --
                     Computational Biology / Fred Hutchinson Cancer
            Research Center
                     1100 Fairview Ave. N.
                     PO Box 19024 Seattle, WA 98109

                     Location: Arnold Building M1 B861
                     Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
            <tel:%28206%29%20667-2793>

            --
            Computational Biologist
            Genentech Research

        _________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

--
Computational Biologist
Genentech Research

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319
Hi Gabe,

Have you had a chance to test writeVcf? The changes made over the past 
week have shaved off more time. It now takes ~ 9 minutes to write the 
NA12877 example.
dim(vcf)
[1] 51612762        1
gc()
             used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells  157818565 8428.5  298615851 15947.9  261235336 13951.5
Vcells 1109849222 8467.5 1778386307 13568.1 1693553890 12920.8
print(system.time(writeVcf(vcf, tempfile())))
   user  system elapsed
555.282   6.700 565.700
gc()
             used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells  157821990 8428.7  329305975 17586.9  261482807 13964.7
Vcells 1176960717 8979.5 2183277445 16657.1 2171401955 16566.5
In the most recent version (1.11.35) I've added chunking for files with 
 > 1e5 records. Right now the choice of # records per chunk is simple, 
based on total records only. We are still experimenting with this. You 
can override default chunking with 'nchunk'. Examples on the man page.

Valerie
Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain <vobencha at fhcrc.org
<mailto:vobencha at fhcrc.org>> wrote:

    The new writeVcf code is in 1.11.28.

    Using the illumina file you suggested, geno fields only, writing now
    takes about 17 minutes.

     > hdr
    class: VCFHeader
    samples(1): NA12877
    meta(6): fileformat ApplyRecalibration ... reference source
    fixed(1): FILTER
    info(22): AC AF ... culprit set
    geno(8): GT GQX ... PL VF

     > param = ScanVcfParam(info=NA)
     > vcf = readVcf(fl, "", param=param)
     > dim(vcf)
    [1] 51612762        1

     > system.time(writeVcf(vcf, "out.vcf"))
         user   system  elapsed
      971.032    6.568 1004.593

    In 1.11.28, parsing of geno data was moved to C. If this didn't
    speed things up enough we were planning to implement 'chunking'
    through the VCF and/or move the parsing of info to C, however, it
    looks like geno was the bottleneck.

    I've tested a number of samples/fields combinations in files with >=
    .5 million rows and the improvement over writeVcf() in release is ~ 90%.

    Valerie

    On 09/04/14 15:28, Valerie Obenchain wrote:

        Thanks Gabe. I should have something for you on Monday.

        Val

        On 09/04/2014 01:56 PM, Gabe Becker wrote:

            Val and Martin,

            Apologies for the delay.

            We realized that the Illumina platinum genome vcf files make
            a good test
            case, assuming you strip out all the info (info=NA when
            reading it into
            R) stuff.

            ftp://platgene:G3n3s4me at ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz
            <ftp://platgene:G3n3s4me at ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>
            took about ~4.2 hrs to write out, and is about 1.5x the size
            of the
            files we are actually dealing with (~50M ranges vs our ~30M).

            Looking forward a new vastly improved writeVcf :).

            ~G

            On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
            <lawrence.michael at gene.com
            <mailto:lawrence.michael at gene.com>
            <mailto:lawrence.michael at gene.__com
            <mailto:lawrence.michael at gene.com>>> wrote:

                 Yes, it's very clear that the scaling is non-linear,
            and Gabe has
                 been experimenting with a chunk-wise + parallel algorithm.
                 Unfortunately there is some frustrating overhead with the
                 parallelism. But I'm glad Val is arriving at something
            quicker.

                 Michael

                 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
            <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
                 <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>> wrote:

                     On 08/27/2014 11:56 AM, Gabe Becker wrote:

                         The profiling I attached in my previous email
            is for 24 geno
                         fields, as I said,
                         but our typical usecase involves only ~4-6
            fields, and is
                         faster but still on
                         the order of dozens of minutes.

                     I think Val is arriving at a (much) more efficient
                     implementation, but...

                     I wanted to share my guess that the poor _scaling_
            is because
                     the garbage collector runs multiple times as the
            different
                     strings are pasted together, and has to traverse,
            in linear
                     time, increasing numbers of allocated SEXPs. So
            times scale
                     approximately quadratically with the number of rows
            in the VCF

                     An efficiency is to reduce the number of SEXPs in
            play by
                     writing out in chunks -- as each chunk is written,
            the SEXPs
                     become available for collection and are re-used.
            Here's my toy
                     example

                     time.R
                     ======
                     splitIndices <- function (nx, ncl)
                     {
                          i <- seq_len(nx)
                          if (ncl == 0L)
                              list()
                          else if (ncl == 1L || nx == 1L)
                              list(i)
                          else {
                              fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
                              breaks <- seq(1 - fuzz, nx + fuzz, length
            = ncl + 1L)
                              structure(split(i, cut(i, breaks,
            labels=FALSE)), names
                     = NULL)
                          }
                     }

                     x = as.character(seq_len(1e7)); y = sample(x)
                     if (!is.na <http://is.na>
            <http://is.na>(Sys.getenv("__SPLIT", NA))) {
                          idx <- splitIndices(length(x), 20)
                          system.time(for (i in idx) paste(x[i], y[i],
            sep=":"))
                     } else {
                          system.time(paste(x, y, sep=":"))
                     }

                     running under R-devel with $ SPLIT=TRUE R --no-save
            --quiet -f
                     time.R the relevant time is

                         user  system elapsed
                       15.320   0.064  15.381

                     versus with $ R --no-save --quiet -f time.R it is

                         user  system elapsed
                       95.360   0.164  95.511

                     I think this is likely an overall strategy when
            dealing with
                     character data -- processing in independent chunks
            of moderate
                     (1M?) size (enabling as a consequence parallel
            evaluation in
                     modest memory) that are sufficient to benefit from
                     vectorization, but that do not entail allocation of
            large
                     numbers of in-use SEXPs.

                     Martin

                         Sorry for the confusion.
                         ~G

                         On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
                         <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
            <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>
                         <mailto:beckerg4 at gene.com
            <mailto:beckerg4 at gene.com> <mailto:beckerg4 at gene.com
            <mailto:beckerg4 at gene.com>>>> wrote:

                              Martin and Val.

                              I re-ran writeVcf on our (G)VCF data
            (34790518 ranges,
                         24 geno fields) with
                              profiling enabled. The results of
            summaryRprof for that
                         run are attached,
                              though for a variety of reasons they are
            pretty
            misleading.

                              It took over an hour to write
            (3700+seconds), so it's
                         definitely a
                              bottleneck when the data get very large,
            even if it
                         isn't for smaller data.

                              Michael and I both think the culprit is
            all the pasting
                         and cbinding that is
                              going on, and more to the point, that
            memory for an
                         internal representation
                              to be written out is allocated at all.
            Streaming
                         across the object, looping
                              by rows and writing directly to file (e.g.
            from C)
                         should be blisteringly
                              fast in comparison.

                              ~G

                              On Tue, Aug 26, 2014 at 11:57 AM, Michael
            Lawrence
                         <michafla at gene.com <mailto:michafla at gene.com>
            <mailto:michafla at gene.com <mailto:michafla at gene.com>>
                              <mailto:michafla at gene.com
            <mailto:michafla at gene.com> <mailto:michafla at gene.com
            <mailto:michafla at gene.com>>>>
                         wrote:

                                  Gabe is still testing/profiling, but
            we'll send
                         something randomized
                                  along eventually.

                                  On Tue, Aug 26, 2014 at 11:15 AM,
            Martin Morgan
                         <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
            <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
                                  <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>
                         <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>>> wrote:

                                      I didn't see in the original thread a
                         reproducible (simulated, I
                                      guess) example, to be explicit
            about what the
                         problem is??

                                      Martin

                                      On 08/26/2014 10:47 AM, Michael
            Lawrence wrote:

                                          My understanding is that the heap
                         optimization provided marginal
                                          gains, and
                                          that we need to think harder
            about how to
                         optimize the all of
                                          the string
                                          manipulation in writeVcf. We
            either need to
                         reduce it or reduce its
                                          overhead (i.e., the CHARSXP
            allocation).
                         Gabe is doing more tests.

                                          On Tue, Aug 26, 2014 at 9:43
            AM, Valerie
                         Obenchain
                                          <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>

                                          wrote:

                                              Hi Gabe,

                                              Martin responded, and so
            did Michael,

            https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html
            <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>

            <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
            <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>>

            <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
            <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>

            <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
            <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>>

                                              It sounded like Michael
            was ok with
                         working with/around heap
                                              initialization.

                                              Michael, is that right or
            should we
                         still consider this on
                                              the table?

                                              Val

                                              On 08/26/2014 09:34 AM,
            Gabe Becker
            wrote:

                                                  Val,

                                                  Has there been any
            movement on
                         this? This remains a
                                                  substantial
                                                  bottleneck for us when
            writing very
                         large VCF files (e.g.
                                                  variants+genotypes for
            whole genome
                         NGS samples).

                                                  I was able to see a
            ~25% speedup
                         with 4 cores and  an
                                                  "optimal" speedup
                                                  of ~2x with 10-12
            cores for a VCF
                         with 500k rows  using
                                                  a very naive
                                                  parallelization
            strategy and no
                         other changes. I suspect
                                                  this could be
                                                  improved on quite a
            bit, or
                         possibly made irrelevant
                                                  with judicious use
                                                  of serial C code.

                                                  Did you and Martin
            make any plans
                         regarding optimizing
                                                  writeVcf?

                                                  Best
                                                  ~G

                                                  On Tue, Aug 5, 2014 at
            2:33 PM,
                         Valerie Obenchain
                                                  <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>

              <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>>

                                                  wrote:

                                                        Hi Michael,

                                                        I'm interested
            in working on
                         this. I'll discuss
                                                  with Martin next
                                                        week when we're
            both back in
                         the office.

                                                        Val

                                                        On 08/05/14
            07:46, Michael
                         Lawrence wrote:

                                                            Hi guys
            (Val, Martin,
            Herve):

                                                            Anyone have
            an itch for
                         optimization? The
                                                  writeVcf function is
                                                            currently a
                                                            bottleneck
            in our WGS
                         genotyping pipeline. For
                                                  a typical 50
                                                            million row
                                                            gVCF, it was
            taking 2.25
                         hours prior to
                                                  yesterday's improvements

              (pasteCollapseRows) that
                         brought it down to
                                                  about 1 hour, which
                                                            is still
                                                            too long by
            my standards
                         (> 0). Only takes 3
                                                  minutes to call the
                                                            genotypes
                                                            (and associated
                         likelihoods etc) from the
                                                  variant calls (using
                                                            80 cores and
                                                            450 GB RAM
            on one node),
                         so the output is an
                                                  issue. Profiling
                                                            suggests that
                                                            the running
            time scales
                         non-linearly in the
                                                  number of rows.

                                                            Digging a
            little deeper,
                         it seems to be
                                                  something with R's
                                                            string/memory
                                                            allocation.
            Below,
                         pasting 1 million strings
                                                  takes 6 seconds, but
                                                  10
                                                            million
            strings takes
                         over 2 minutes. It gets
                                                  way worse with 50
                                                            million. I
                                                            suspect it
            has something
                         to do with R's string
                                                  hash table.

                                                            set.seed(1000)
                                                            end <-
            sample(1e8, 1e6)

              system.time(paste0("END",
                         "=", end))
                                                                 user
            system elapsed
                                                                6.396
              0.028   6.420

                                                            end <-
            sample(1e8, 1e7)

              system.time(paste0("END",
                         "=", end))
                                                                 user
            system elapsed
                                                            134.714
              0.352 134.978

                                                            Indeed, even
            this takes a
                         long time (in a
                                                  fresh session):

                                                            set.seed(1000)
                                                            end <-
            sample(1e8, 1e6)
                                                            end <-
            sample(1e8, 1e7)

                         system.time(as.character(end))
                                                                 user
            system elapsed
                                                               57.224
              0.156  57.366

                                                            But running
            it a second
                         time is faster (about
                                                  what one would
                                                            expect?):

              system.time(levels <-
                         as.character(end))
                                                                 user
            system elapsed
                                                               23.582
              0.021  23.589

                                                            I did some
            simple
                         profiling of R to find that
                                                  the resizing of
                                                            the string
                                                            hash table
            is not a
                         significant component of
                                                  the time. So maybe
                                                            something
                                                            to do with
            the R heap/gc?
                         No time right now to
                                                  go deeper. But I
                                                            know Martin
                                                            likes this
            sort of
            thing ;)

                                                            Michael

            [[alternative
                         HTML version deleted]]

            _______________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>

              <mailto:Bioc-devel at r-project <mailto:Bioc-devel at r-project>.
                         <mailto:Bioc-devel at r-project
            <mailto:Bioc-devel at r-project>.>______org

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>>
                                                            mailing list

            https://stat.ethz.ch/mailman/________listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>

            _______________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>

              <mailto:Bioc-devel at r-project <mailto:Bioc-devel at r-project>.
                         <mailto:Bioc-devel at r-project
            <mailto:Bioc-devel at r-project>.>______org

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>> mailing
                                                  list
            https://stat.ethz.ch/mailman/________listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>

                                                  --
                                                  Computational Biologist
                                                  Genentech Research

                                                   [[alternative HTML
            version
            deleted]]

            _____________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>
                         <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>
                                          mailing list
            https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>

                                      --
                                      Computational Biology / Fred
            Hutchinson Cancer
                         Research Center
                                      1100 Fairview Ave. N.
                                      PO Box 19024 Seattle, WA 98109

                                      Location: Arnold Building M1 B861
                                      Phone: (206) 667-2793
            <tel:%28206%29%20667-2793>
                         <tel:%28206%29%20667-2793>
            <tel:%28206%29%20667-2793>

                              --
                              Computational Biologist
                              Genentech Research

                         --
                         Computational Biologist
                         Genentech Research

                     --
                     Computational Biology / Fred Hutchinson Cancer
            Research Center
                     1100 Fairview Ave. N.
                     PO Box 19024 Seattle, WA 98109

                     Location: Arnold Building M1 B861
                     Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
            <tel:%28206%29%20667-2793>

            --
            Computational Biologist
            Genentech Research

        _________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

--
Computational Biologist
Genentech Research
Valerie,

Apologies for this taking much longer than it should have. The changes in
Bioc-devel have wreaked havoc on the code we use to to generate and process
the data we need to write out, but the fault is mine for not getting on top
of it sooner.

I'm not seeing the speed you mentioned above in the latest devel version
(1.11.35). It took ~1.5hrs to write the an expanded vcf with  56M rows
(print output and sessionInfo() follow). I'll try reading in the illumina
platinum and writing it back out to see if it is something about our
specific vcf object (could ExpandedVCF vs VCF be an issue?).
vcfgeno
*class: ExpandedVCF *
*dim: 50307989 1 *
rowData(vcf):
  GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 1 column: END
  Fields with no header: END
geno(vcf):
  SimpleList of length 7: AD, DP, FT, GT, GQ, PL, MIN_DP
geno(header(vcf)):
          Number Type    Description

   AD     2      Integer Allelic depths (number of reads in each observed
al...
   DP     1      Integer Total read depth

   FT     1      String  Variant filters

   GT     1      String  Genotype

   GQ     1      Integer Genotype quality

   PL     3      Integer Normalized, Phred-scaled likelihoods for genotypes

   MIN_DP 1      Integer Minimum DP observed within the GVCF block
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] VariantCallingPaper_0.0.3 GenomicFeatures_1.17.17
 [3] AnnotationDbi_1.27.16     Biobase_2.25.0
 [5] gmapR_1.7.8               VTGenotyping_0.0.1
 [7] BiocParallel_0.99.22      futile.logger_1.3.7
 [9] VariantTools_1.7.5        *VariantAnnotation_1.11.35*
[11] Rsamtools_1.17.34         Biostrings_2.33.14
[13] XVector_0.5.8             rtracklayer_1.25.16
[15] GenomicRanges_1.17.42     GenomeInfoDb_1.1.23
[17] IRanges_1.99.28           S4Vectors_0.2.4
[19] BiocGenerics_0.11.5       switchr_0.2.1

loaded via a namespace (and not attached):
 [1] annotate_1.43.5                base64enc_0.1-2
 [3] BatchJobs_1.4                  BBmisc_1.7
 [5] biomaRt_2.21.1                 bitops_1.0-6
 [7] brew_1.0-6                     BSgenome_1.33.9
 [9] CGPtools_2.2.0                 checkmate_1.4
[11] codetools_0.2-9                DBI_0.3.1
[13] DESeq_1.17.0                   digest_0.6.4
[15] fail_1.2                       foreach_1.4.2
[17] futile.options_1.0.0           genefilter_1.47.6
[19] geneplotter_1.43.0             GenomicAlignments_1.1.29
[21] genoset_1.19.32                gneDB_0.4.18
[23] grid_3.1.1                     iterators_1.0.7
[25] lambda.r_1.1.6                 lattice_0.20-29
[27] Matrix_1.1-4                   RColorBrewer_1.0-5
[29] RCurl_1.95-4.3                 rjson_0.2.14
[31] RSQLite_0.11.4                 sendmailR_1.2-1
[33] splines_3.1.1                  stringr_0.6.2
[35] survival_2.37-7                tools_3.1.1
[37] TxDb.Hsapiens.BioMart.igis_2.3 XML_3.98-1.1
[39] xtable_1.7-4                   zlibbioc_1.11.1

On Wed, Sep 17, 2014 at 2:08 PM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:
Hi Gabe,

Have you had a chance to test writeVcf? The changes made over the past
week have shaved off more time. It now takes ~ 9 minutes to write the
NA12877 example.

 dim(vcf)

[1] 51612762        1

gc()

             used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells  157818565 8428.5  298615851 15947.9  261235336 13951.5
Vcells 1109849222 8467.5 1778386307 13568.1 1693553890 12920.8

print(system.time(writeVcf(vcf, tempfile())))

   user  system elapsed
555.282   6.700 565.700

gc()

             used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells  157821990 8428.7  329305975 17586.9  261482807 13964.7
Vcells 1176960717 8979.5 2183277445 16657.1 2171401955 16566.5

In the most recent version (1.11.35) I've added chunking for files with >
1e5 records. Right now the choice of # records per chunk is simple, based
on total records only. We are still experimenting with this. You can
override default chunking with 'nchunk'. Examples on the man page.

Valerie

On 09/08/14 08:43, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain <vobencha at fhcrc.org
<mailto:vobencha at fhcrc.org>> wrote:

    The new writeVcf code is in 1.11.28.

    Using the illumina file you suggested, geno fields only, writing now
    takes about 17 minutes.

     > hdr
    class: VCFHeader
    samples(1): NA12877
    meta(6): fileformat ApplyRecalibration ... reference source
    fixed(1): FILTER
    info(22): AC AF ... culprit set
    geno(8): GT GQX ... PL VF

     > param = ScanVcfParam(info=NA)
     > vcf = readVcf(fl, "", param=param)
     > dim(vcf)
    [1] 51612762        1

     > system.time(writeVcf(vcf, "out.vcf"))
         user   system  elapsed
      971.032    6.568 1004.593

    In 1.11.28, parsing of geno data was moved to C. If this didn't
    speed things up enough we were planning to implement 'chunking'
    through the VCF and/or move the parsing of info to C, however, it
    looks like geno was the bottleneck.

    I've tested a number of samples/fields combinations in files with >=
    .5 million rows and the improvement over writeVcf() in release is ~
90%.

    Valerie

    On 09/04/14 15:28, Valerie Obenchain wrote:

        Thanks Gabe. I should have something for you on Monday.

        Val

        On 09/04/2014 01:56 PM, Gabe Becker wrote:

            Val and Martin,

            Apologies for the delay.

            We realized that the Illumina platinum genome vcf files make
            a good test
            case, assuming you strip out all the info (info=NA when
            reading it into
            R) stuff.

            ftp://platgene:G3n3s4me at ussd-__ftp.illumina.com/NA12877_S1._
_genome.vcf.gz
            <ftp://platgene:G3n3s4me at ussd-ftp.illumina.com/NA12877_S1.
genome.vcf.gz>
            took about ~4.2 hrs to write out, and is about 1.5x the size
            of the
            files we are actually dealing with (~50M ranges vs our ~30M).

            Looking forward a new vastly improved writeVcf :).

            ~G

            On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
            <lawrence.michael at gene.com
            <mailto:lawrence.michael at gene.com>
            <mailto:lawrence.michael at gene.__com
            <mailto:lawrence.michael at gene.com>>> wrote:

                 Yes, it's very clear that the scaling is non-linear,
            and Gabe has
                 been experimenting with a chunk-wise + parallel
algorithm.
                 Unfortunately there is some frustrating overhead with the
                 parallelism. But I'm glad Val is arriving at something
            quicker.

                 Michael

                 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
            <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
                 <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>> wrote:

                     On 08/27/2014 11:56 AM, Gabe Becker wrote:

                         The profiling I attached in my previous email
            is for 24 geno
                         fields, as I said,
                         but our typical usecase involves only ~4-6
            fields, and is
                         faster but still on
                         the order of dozens of minutes.

                     I think Val is arriving at a (much) more efficient
                     implementation, but...

                     I wanted to share my guess that the poor _scaling_
            is because
                     the garbage collector runs multiple times as the
            different
                     strings are pasted together, and has to traverse,
            in linear
                     time, increasing numbers of allocated SEXPs. So
            times scale
                     approximately quadratically with the number of rows
            in the VCF

                     An efficiency is to reduce the number of SEXPs in
            play by
                     writing out in chunks -- as each chunk is written,
            the SEXPs
                     become available for collection and are re-used.
            Here's my toy
                     example

                     time.R
                     ======
                     splitIndices <- function (nx, ncl)
                     {
                          i <- seq_len(nx)
                          if (ncl == 0L)
                              list()
                          else if (ncl == 1L || nx == 1L)
                              list(i)
                          else {
                              fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
                              breaks <- seq(1 - fuzz, nx + fuzz, length
            = ncl + 1L)
                              structure(split(i, cut(i, breaks,
            labels=FALSE)), names
                     = NULL)
                          }
                     }

                     x = as.character(seq_len(1e7)); y = sample(x)
                     if (!is.na <http://is.na>
            <http://is.na>(Sys.getenv("__SPLIT", NA))) {

                          idx <- splitIndices(length(x), 20)
                          system.time(for (i in idx) paste(x[i], y[i],
            sep=":"))
                     } else {
                          system.time(paste(x, y, sep=":"))
                     }

                     running under R-devel with $ SPLIT=TRUE R --no-save
            --quiet -f
                     time.R the relevant time is

                         user  system elapsed
                       15.320   0.064  15.381

                     versus with $ R --no-save --quiet -f time.R it is

                         user  system elapsed
                       95.360   0.164  95.511

                     I think this is likely an overall strategy when
            dealing with
                     character data -- processing in independent chunks
            of moderate
                     (1M?) size (enabling as a consequence parallel
            evaluation in
                     modest memory) that are sufficient to benefit from
                     vectorization, but that do not entail allocation of
            large
                     numbers of in-use SEXPs.

                     Martin

                         Sorry for the confusion.
                         ~G

                         On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
                         <beckerg4 at gene.com <mailto:beckerg4 at gene.com>
            <mailto:beckerg4 at gene.com <mailto:beckerg4 at gene.com>>
                         <mailto:beckerg4 at gene.com
            <mailto:beckerg4 at gene.com> <mailto:beckerg4 at gene.com
            <mailto:beckerg4 at gene.com>>>> wrote:

                              Martin and Val.

                              I re-ran writeVcf on our (G)VCF data
            (34790518 ranges,
                         24 geno fields) with
                              profiling enabled. The results of
            summaryRprof for that
                         run are attached,
                              though for a variety of reasons they are
            pretty
            misleading.

                              It took over an hour to write
            (3700+seconds), so it's
                         definitely a
                              bottleneck when the data get very large,
            even if it
                         isn't for smaller data.

                              Michael and I both think the culprit is
            all the pasting
                         and cbinding that is
                              going on, and more to the point, that
            memory for an
                         internal representation
                              to be written out is allocated at all.
            Streaming
                         across the object, looping
                              by rows and writing directly to file (e.g.
            from C)
                         should be blisteringly
                              fast in comparison.

                              ~G

                              On Tue, Aug 26, 2014 at 11:57 AM, Michael
            Lawrence
                         <michafla at gene.com <mailto:michafla at gene.com>
            <mailto:michafla at gene.com <mailto:michafla at gene.com>>
                              <mailto:michafla at gene.com
            <mailto:michafla at gene.com> <mailto:michafla at gene.com
            <mailto:michafla at gene.com>>>>
                         wrote:

                                  Gabe is still testing/profiling, but
            we'll send
                         something randomized
                                  along eventually.

                                  On Tue, Aug 26, 2014 at 11:15 AM,
            Martin Morgan
                         <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
            <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
                                  <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>
                         <mailto:mtmorgan at fhcrc.org
            <mailto:mtmorgan at fhcrc.org>>>> wrote:

                                      I didn't see in the original thread
a
                         reproducible (simulated, I
                                      guess) example, to be explicit
            about what the
                         problem is??

                                      Martin

                                      On 08/26/2014 10:47 AM, Michael
            Lawrence wrote:

                                          My understanding is that the
heap
                         optimization provided marginal
                                          gains, and
                                          that we need to think harder
            about how to
                         optimize the all of
                                          the string
                                          manipulation in writeVcf. We
            either need to
                         reduce it or reduce its
                                          overhead (i.e., the CHARSXP
            allocation).
                         Gabe is doing more tests.

                                          On Tue, Aug 26, 2014 at 9:43
            AM, Valerie
                         Obenchain
                                          <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>

                                          wrote:

                                              Hi Gabe,

                                              Martin responded, and so
            did Michael,

            https://stat.ethz.ch/______pipermail/bioc-devel/2014-____
__August/006082.html
            <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____
August/006082.html>

            <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____
August/006082.html
            <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__
August/006082.html>>

            <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____
August/006082.html
            <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__
August/006082.html>

            <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__
August/006082.html
            <https://stat.ethz.ch/pipermail/bioc-devel/2014-
August/006082.html>>>

                                              It sounded like Michael
            was ok with
                         working with/around heap
                                              initialization.

                                              Michael, is that right or
            should we
                         still consider this on
                                              the table?

                                              Val

                                              On 08/26/2014 09:34 AM,
            Gabe Becker
            wrote:

                                                  Val,

                                                  Has there been any
            movement on
                         this? This remains a
                                                  substantial
                                                  bottleneck for us when
            writing very
                         large VCF files (e.g.
                                                  variants+genotypes for
            whole genome
                         NGS samples).

                                                  I was able to see a
            ~25% speedup
                         with 4 cores and  an
                                                  "optimal" speedup
                                                  of ~2x with 10-12
            cores for a VCF
                         with 500k rows  using
                                                  a very naive
                                                  parallelization
            strategy and no
                         other changes. I suspect
                                                  this could be
                                                  improved on quite a
            bit, or
                         possibly made irrelevant
                                                  with judicious use
                                                  of serial C code.

                                                  Did you and Martin
            make any plans
                         regarding optimizing
                                                  writeVcf?

                                                  Best
                                                  ~G

                                                  On Tue, Aug 5, 2014 at
            2:33 PM,
                         Valerie Obenchain
                                                  <vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>

              <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>> <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>
                         <mailto:vobencha at fhcrc.org
            <mailto:vobencha at fhcrc.org>>>>>

                                                  wrote:

                                                        Hi Michael,

                                                        I'm interested
            in working on
                         this. I'll discuss
                                                  with Martin next
                                                        week when we're
            both back in
                         the office.

                                                        Val

                                                        On 08/05/14
            07:46, Michael
                         Lawrence wrote:

                                                            Hi guys
            (Val, Martin,
            Herve):

                                                            Anyone have
            an itch for
                         optimization? The
                                                  writeVcf function is
                                                            currently a
                                                            bottleneck
            in our WGS
                         genotyping pipeline. For
                                                  a typical 50
                                                            million row
                                                            gVCF, it was
            taking 2.25
                         hours prior to
                                                  yesterday's improvements

              (pasteCollapseRows) that
                         brought it down to
                                                  about 1 hour, which
                                                            is still
                                                            too long by
            my standards
                         (> 0). Only takes 3
                                                  minutes to call the
                                                            genotypes
                                                            (and
associated
                         likelihoods etc) from the
                                                  variant calls (using
                                                            80 cores and
                                                            450 GB RAM
            on one node),
                         so the output is an
                                                  issue. Profiling
                                                            suggests that
                                                            the running
            time scales
                         non-linearly in the
                                                  number of rows.

                                                            Digging a
            little deeper,
                         it seems to be
                                                  something with R's
                                                            string/memory
                                                            allocation.
            Below,
                         pasting 1 million strings
                                                  takes 6 seconds, but
                                                  10
                                                            million
            strings takes
                         over 2 minutes. It gets
                                                  way worse with 50
                                                            million. I
                                                            suspect it
            has something
                         to do with R's string
                                                  hash table.

                                                            set.seed(1000)
                                                            end <-
            sample(1e8, 1e6)

              system.time(paste0("END",
                         "=", end))
                                                                 user
            system elapsed
                                                                6.396
              0.028   6.420

                                                            end <-
            sample(1e8, 1e7)

              system.time(paste0("END",
                         "=", end))
                                                                 user
            system elapsed
                                                            134.714
              0.352 134.978

                                                            Indeed, even
            this takes a
                         long time (in a
                                                  fresh session):

                                                            set.seed(1000)
                                                            end <-
            sample(1e8, 1e6)
                                                            end <-
            sample(1e8, 1e7)

                         system.time(as.character(end))
                                                                 user
            system elapsed
                                                               57.224
              0.156  57.366

                                                            But running
            it a second
                         time is faster (about
                                                  what one would
                                                            expect?):

              system.time(levels <-
                         as.character(end))
                                                                 user
            system elapsed
                                                               23.582
              0.021  23.589

                                                            I did some
            simple
                         profiling of R to find that
                                                  the resizing of
                                                            the string
                                                            hash table
            is not a
                         significant component of
                                                  the time. So maybe
                                                            something
                                                            to do with
            the R heap/gc?
                         No time right now to
                                                  go deeper. But I
                                                            know Martin
                                                            likes this
            sort of
            thing ;)

                                                            Michael

            [[alternative
                         HTML version deleted]]

            _______________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>

              <mailto:Bioc-devel at r-project <mailto:Bioc-devel at r-project>.
                         <mailto:Bioc-devel at r-project
            <mailto:Bioc-devel at r-project>.>______org

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>>
                                                            mailing list

            https://stat.ethz.ch/mailman/________listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>

            _______________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>

              <mailto:Bioc-devel at r-project <mailto:Bioc-devel at r-project>.
                         <mailto:Bioc-devel at r-project
            <mailto:Bioc-devel at r-project>.>______org

              <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>> mailing
                                                  list
            https://stat.ethz.ch/mailman/________listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

            <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>

                                                  --
                                                  Computational Biologist
                                                  Genentech Research

                                                   [[alternative HTML
            version
            deleted]]

            _____________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>
                         <mailto:Bioc-devel at r-project.
            <mailto:Bioc-devel at r-project.>____org
                         <mailto:Bioc-devel at r-project.__org
            <mailto:Bioc-devel at r-project.org>>>
                                          mailing list
            https://stat.ethz.ch/mailman/______listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

            <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

            <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>

                                      --
                                      Computational Biology / Fred
            Hutchinson Cancer
                         Research Center
                                      1100 Fairview Ave. N.
                                      PO Box 19024 Seattle, WA 98109

                                      Location: Arnold Building M1 B861
                                      Phone: (206) 667-2793
            <tel:%28206%29%20667-2793>
                         <tel:%28206%29%20667-2793>
            <tel:%28206%29%20667-2793>

                              --
                              Computational Biologist
                              Genentech Research

                         --
                         Computational Biologist
                         Genentech Research

                     --
                     Computational Biology / Fred Hutchinson Cancer
            Research Center
                     1100 Fairview Ave. N.
                     PO Box 19024 Seattle, WA 98109

                     Location: Arnold Building M1 B861
                     Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
            <tel:%28206%29%20667-2793>

            --
            Computational Biologist
            Genentech Research

        _________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>

--
Computational Biologist
Genentech Research

Computational Biologist
Genentech Research

	[[alternative HTML version deleted]]
Hi Gabe,

It would help to have a common baseline. Please show the output for 
writing the Illumina file you sent originally:

library(VariantAnnotation)
fl <- "NA12877_S1.genome.vcf.gz"
vcf <- readVcf(fl, "", param=ScanVcfParam(info=NA))
dim(vcf)

gc()
print(system.time(writeVcf(vcf, tempfile())))
gc()

Here's my output. For this file, writing takes between 8-10 minutes 
depending on machine load. (Machine has 378 GIG of RAM.)
dim(vcf)
[1] 51612762        1
gc()
             used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells  157824501 8428.8  298615851 15947.9  261241334 13951.8
Vcells 1109937571 8468.2 1778479081 13568.8 1693642432 12921.5
print(system.time(writeVcf(vcf, tempfile())))
   user  system elapsed
601.265   9.261 614.230
gc()
             used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells  157829989 8429.1  298615851 15947.9  261241334 13951.8
Vcells 1109941106 8468.2 2270282109 17320.9 2243573149 17117.2
I'm not seeing substantial differences in writeVcf with expanded vs 
collapsed VCF. However, as an aside, note that calling writeVcf on 
VRanges (or equivalently ExpandedVCF) the new file will have multiple 
rows per position if there were multiple ALT alleles. I'm not sure this 
is technically valid as per the VCF specs but am assuming Michael knew 
about this.

Five positions in original VCF:
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, "")
 > dim(vcf)
[1] 5 3

vr <- as(vcf, "VRanges")
 > length(vr)
[1] 21

Before writing out, VRanges is coerced to ExpandedVCF; 7 rows are 
written out.
 > dim(as(vr, "VCF"))
[1] 7 3

Can you provide a portion of 'vcfgeno' for testing?

Valerie
sessionInfo()
R version 3.1.0 Patched (2014-06-24 r66016)
Platform: x86_64-unknown-linux-gnu (64-bit)

...
other attached packages:
[1] VariantAnnotation_1.11.35 Rsamtools_1.17.34
[3] Biostrings_2.33.14        XVector_0.5.8
[5] GenomicRanges_1.17.42     GenomeInfoDb_1.1.23
[7] IRanges_1.99.28           S4Vectors_0.2.4
[9] BiocGenerics_0.11.5

Valerie,

Apologies for this taking much longer than it should have. The changes
in Bioc-devel have wreaked havoc on the code we use to to generate and
process the data we need to write out, but the fault is mine for not
getting on top of it sooner.

I'm not seeing the speed you mentioned above in the latest devel version
(1.11.35). It took ~1.5hrs to write the an expanded vcf with  56M rows
(print output and sessionInfo() follow). I'll try reading in the
illumina platinum and writing it back out to see if it is something
about our specific vcf object (could ExpandedVCF vs VCF be an issue?).

 > vcfgeno
*class: ExpandedVCF *
*dim: 50307989 1 *
rowData(vcf):
   GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
info(vcf):
   DataFrame with 1 column: END
   Fields with no header: END
geno(vcf):
   SimpleList of length 7: AD, DP, FT, GT, GQ, PL, MIN_DP
geno(header(vcf)):
           Number Type    Description
    AD     2      Integer Allelic depths (number of reads in each
observed al...
    DP     1      Integer Total read depth
    FT     1      String  Variant filters
    GT     1      String  Genotype
    GQ     1      Integer Genotype quality
    PL     3      Integer Normalized, Phred-scaled likelihoods for
genotypes
    MIN_DP 1      Integer Minimum DP observed within the GVCF block
 > sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
  [1] VariantCallingPaper_0.0.3 GenomicFeatures_1.17.17
  [3] AnnotationDbi_1.27.16     Biobase_2.25.0
  [5] gmapR_1.7.8               VTGenotyping_0.0.1
  [7] BiocParallel_0.99.22      futile.logger_1.3.7
  [9] VariantTools_1.7.5 *VariantAnnotation_1.11.35*
[11] Rsamtools_1.17.34         Biostrings_2.33.14
[13] XVector_0.5.8             rtracklayer_1.25.16
[15] GenomicRanges_1.17.42     GenomeInfoDb_1.1.23
[17] IRanges_1.99.28           S4Vectors_0.2.4
[19] BiocGenerics_0.11.5       switchr_0.2.1

loaded via a namespace (and not attached):
  [1] annotate_1.43.5                base64enc_0.1-2
  [3] BatchJobs_1.4                  BBmisc_1.7
  [5] biomaRt_2.21.1                 bitops_1.0-6
  [7] brew_1.0-6                     BSgenome_1.33.9
  [9] CGPtools_2.2.0                 checkmate_1.4
[11] codetools_0.2-9                DBI_0.3.1
[13] DESeq_1.17.0                   digest_0.6.4
[15] fail_1.2                       foreach_1.4.2
[17] futile.options_1.0.0           genefilter_1.47.6
[19] geneplotter_1.43.0             GenomicAlignments_1.1.29
[21] genoset_1.19.32                gneDB_0.4.18
[23] grid_3.1.1                     iterators_1.0.7
[25] lambda.r_1.1.6                 lattice_0.20-29
[27] Matrix_1.1-4                   RColorBrewer_1.0-5
[29] RCurl_1.95-4.3                 rjson_0.2.14
[31] RSQLite_0.11.4                 sendmailR_1.2-1
[33] splines_3.1.1                  stringr_0.6.2
[35] survival_2.37-7                tools_3.1.1
[37] TxDb.Hsapiens.BioMart.igis_2.3 XML_3.98-1.1
[39] xtable_1.7-4                   zlibbioc_1.11.1

On Wed, Sep 17, 2014 at 2:08 PM, Valerie Obenchain <vobencha at fhcrc.org
<mailto:vobencha at fhcrc.org>> wrote:

    Hi Gabe,

    Have you had a chance to test writeVcf? The changes made over the
    past week have shaved off more time. It now takes ~ 9 minutes to
    write the NA12877 example.

            dim(vcf)

        [1] 51612762        1

            gc()

                      used   (Mb) gc trigger    (Mb)   max used    (Mb)
        Ncells  157818565 8428.5  298615851 15947.9  261235336 13951.5
        Vcells 1109849222 8467.5 1778386307 13568.1 1693553890 12920.8

            print(system.time(writeVcf(__vcf, tempfile())))

            user  system elapsed
        555.282   6.700 565.700

            gc()

                      used   (Mb) gc trigger    (Mb)   max used    (Mb)
        Ncells  157821990 8428.7  329305975 17586.9  261482807 13964.7
        Vcells 1176960717 8979.5 2183277445 <tel:2183277445> 16657.1
        2171401955 16566.5

    In the most recent version (1.11.35) I've added chunking for files
    with > 1e5 records. Right now the choice of # records per chunk is
    simple, based on total records only. We are still experimenting with
    this. You can override default chunking with 'nchunk'. Examples on
    the man page.

    Valerie

    On 09/08/14 08:43, Gabe Becker wrote:

        Val,

        That is great. I'll check this out and test it on our end.

        ~G

        On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain
        <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
        <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>> wrote:

             The new writeVcf code is in 1.11.28.

             Using the illumina file you suggested, geno fields only,
        writing now
             takes about 17 minutes.

              > hdr
             class: VCFHeader
             samples(1): NA12877
             meta(6): fileformat ApplyRecalibration ... reference source
             fixed(1): FILTER
             info(22): AC AF ... culprit set
             geno(8): GT GQX ... PL VF

              > param = ScanVcfParam(info=NA)
              > vcf = readVcf(fl, "", param=param)
              > dim(vcf)
             [1] 51612762        1

              > system.time(writeVcf(vcf, "out.vcf"))
                  user   system  elapsed
               971.032    6.568 1004.593

             In 1.11.28, parsing of geno data was moved to C. If this didn't
             speed things up enough we were planning to implement 'chunking'
             through the VCF and/or move the parsing of info to C,
        however, it
             looks like geno was the bottleneck.

             I've tested a number of samples/fields combinations in
        files with >=
             .5 million rows and the improvement over writeVcf() in
        release is ~ 90%.

             Valerie

             On 09/04/14 15:28, Valerie Obenchain wrote:

                 Thanks Gabe. I should have something for you on Monday.

                 Val

                 On 09/04/2014 01:56 PM, Gabe Becker wrote:

                     Val and Martin,

                     Apologies for the delay.

                     We realized that the Illumina platinum genome vcf
        files make
                     a good test
                     case, assuming you strip out all the info (info=NA when
                     reading it into
                     R) stuff.

        ftp://platgene:G3n3s4me at ussd-____ftp.illumina.com/NA12877_S1.____genome.vcf.gz
        <ftp://platgene:G3n3s4me at ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz>

        <ftp://platgene:G3n3s4me at ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz
        <ftp://platgene:G3n3s4me at ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>>
                     took about ~4.2 hrs to write out, and is about 1.5x
        the size
                     of the
                     files we are actually dealing with (~50M ranges vs
        our ~30M).

                     Looking forward a new vastly improved writeVcf :).

                     ~G

                     On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
                     <lawrence.michael at gene.com
        <mailto:lawrence.michael at gene.com>
                     <mailto:lawrence.michael at gene.__com
        <mailto:lawrence.michael at gene.com>>
                     <mailto:lawrence.michael at gene.
        <mailto:lawrence.michael at gene.>____com
                     <mailto:lawrence.michael at gene.__com
        <mailto:lawrence.michael at gene.com>>>> wrote:

                          Yes, it's very clear that the scaling is
        non-linear,
                     and Gabe has
                          been experimenting with a chunk-wise +
        parallel algorithm.
                          Unfortunately there is some frustrating
        overhead with the
                          parallelism. But I'm glad Val is arriving at
        something
                     quicker.

                          Michael

                          On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
                     <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>
        <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
                          <mailto:mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>
                     <mailto:mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>>>> wrote:

                              On 08/27/2014 11:56 AM, Gabe Becker wrote:

                                  The profiling I attached in my
        previous email
                     is for 24 geno
                                  fields, as I said,
                                  but our typical usecase involves only ~4-6
                     fields, and is
                                  faster but still on
                                  the order of dozens of minutes.

                              I think Val is arriving at a (much) more
        efficient
                              implementation, but...

                              I wanted to share my guess that the poor
        _scaling_
                     is because
                              the garbage collector runs multiple times
        as the
                     different
                              strings are pasted together, and has to
        traverse,
                     in linear
                              time, increasing numbers of allocated
        SEXPs. So
                     times scale
                              approximately quadratically with the
        number of rows
                     in the VCF

                              An efficiency is to reduce the number of
        SEXPs in
                     play by
                              writing out in chunks -- as each chunk is
        written,
                     the SEXPs
                              become available for collection and are
        re-used.
                     Here's my toy
                              example

                              time.R
                              ======
                              splitIndices <- function (nx, ncl)
                              {
                                   i <- seq_len(nx)
                                   if (ncl == 0L)
                                       list()
                                   else if (ncl == 1L || nx == 1L)
                                       list(i)
                                   else {
                                       fuzz <- min((nx - 1L)/1000, 0.4 *
        nx/ncl)
                                       breaks <- seq(1 - fuzz, nx +
        fuzz, length
                     = ncl + 1L)
                                       structure(split(i, cut(i, breaks,
                     labels=FALSE)), names
                              = NULL)
                                   }
                              }

                              x = as.character(seq_len(1e7)); y = sample(x)
                              if (!is.na <http://is.na> <http://is.na>
                     <http://is.na>(Sys.getenv("____SPLIT", NA))) {

                                   idx <- splitIndices(length(x), 20)
                                   system.time(for (i in idx)
        paste(x[i], y[i],
                     sep=":"))
                              } else {
                                   system.time(paste(x, y, sep=":"))
                              }

                              running under R-devel with $ SPLIT=TRUE R
        --no-save
                     --quiet -f
                              time.R the relevant time is

                                  user  system elapsed
                                15.320   0.064  15.381

                              versus with $ R --no-save --quiet -f
        time.R it is

                                  user  system elapsed
                                95.360   0.164  95.511

                              I think this is likely an overall strategy
        when
                     dealing with
                              character data -- processing in
        independent chunks
                     of moderate
                              (1M?) size (enabling as a consequence parallel
                     evaluation in
                              modest memory) that are sufficient to
        benefit from
                              vectorization, but that do not entail
        allocation of
                     large
                              numbers of in-use SEXPs.

                              Martin

                                  Sorry for the confusion.
                                  ~G

                                  On Wed, Aug 27, 2014 at 11:45 AM, Gabe
        Becker
                                  <beckerg4 at gene.com
        <mailto:beckerg4 at gene.com> <mailto:beckerg4 at gene.com
        <mailto:beckerg4 at gene.com>>
                     <mailto:beckerg4 at gene.com
        <mailto:beckerg4 at gene.com> <mailto:beckerg4 at gene.com
        <mailto:beckerg4 at gene.com>>>
                                  <mailto:beckerg4 at gene.com
        <mailto:beckerg4 at gene.com>
                     <mailto:beckerg4 at gene.com
        <mailto:beckerg4 at gene.com>> <mailto:beckerg4 at gene.com
        <mailto:beckerg4 at gene.com>
                     <mailto:beckerg4 at gene.com
        <mailto:beckerg4 at gene.com>>>>> wrote:

                                       Martin and Val.

                                       I re-ran writeVcf on our (G)VCF data
                     (34790518 ranges,
                                  24 geno fields) with
                                       profiling enabled. The results of
                     summaryRprof for that
                                  run are attached,
                                       though for a variety of reasons
        they are
                     pretty
                     misleading.

                                       It took over an hour to write
                     (3700+seconds), so it's
                                  definitely a
                                       bottleneck when the data get very
        large,
                     even if it
                                  isn't for smaller data.

                                       Michael and I both think the
        culprit is
                     all the pasting
                                  and cbinding that is
                                       going on, and more to the point, that
                     memory for an
                                  internal representation
                                       to be written out is allocated at
        all.
                     Streaming
                                  across the object, looping
                                       by rows and writing directly to
        file (e.g.
                     from C)
                                  should be blisteringly
                                       fast in comparison.

                                       ~G

                                       On Tue, Aug 26, 2014 at 11:57 AM,
        Michael
                     Lawrence
                                  <michafla at gene.com
        <mailto:michafla at gene.com> <mailto:michafla at gene.com
        <mailto:michafla at gene.com>>
                     <mailto:michafla at gene.com
        <mailto:michafla at gene.com> <mailto:michafla at gene.com
        <mailto:michafla at gene.com>>>
                                       <mailto:michafla at gene.com
        <mailto:michafla at gene.com>
                     <mailto:michafla at gene.com
        <mailto:michafla at gene.com>> <mailto:michafla at gene.com
        <mailto:michafla at gene.com>
                     <mailto:michafla at gene.com
        <mailto:michafla at gene.com>>>>>
                                  wrote:

                                           Gabe is still
        testing/profiling, but
                     we'll send
                                  something randomized
                                           along eventually.

                                           On Tue, Aug 26, 2014 at 11:15 AM,
                     Martin Morgan
                                  <mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org> <mailto:mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>>
                     <mailto:mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org> <mailto:mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>>>
                                           <mailto:mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>
                     <mailto:mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
                                  <mailto:mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>
                     <mailto:mtmorgan at fhcrc.org
        <mailto:mtmorgan at fhcrc.org>>>>> wrote:

                                               I didn't see in the
        original thread a
                                  reproducible (simulated, I
                                               guess) example, to be
        explicit
                     about what the
                                  problem is??

                                               Martin

                                               On 08/26/2014 10:47 AM,
        Michael
                     Lawrence wrote:

                                                   My understanding is
        that the heap
                                  optimization provided marginal
                                                   gains, and
                                                   that we need to think
        harder
                     about how to
                                  optimize the all of
                                                   the string
                                                   manipulation in
        writeVcf. We
                     either need to
                                  reduce it or reduce its
                                                   overhead (i.e., the
        CHARSXP
                     allocation).
                                  Gabe is doing more tests.

                                                   On Tue, Aug 26, 2014
        at 9:43
                     AM, Valerie
                                  Obenchain
                                                   <vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>
                     <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
                                  <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>
                     <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>>> <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>
                     <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
                                  <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>
                     <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>>>>>

                                                   wrote:

                                                       Hi Gabe,

                                                       Martin responded,
        and so
                     did Michael,

        https://stat.ethz.ch/________pipermail/bioc-devel/2014-________August/006082.html
        <https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html>

        <https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html
        <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>>

        <https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html
        <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>

        <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
        <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>>>

        <https://stat.ethz.ch/______pipermail/bioc-devel/2014-______August/006082.html
        <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html>

        <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
        <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>>

        <https://stat.ethz.ch/____pipermail/bioc-devel/2014-____August/006082.html
        <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>

        <https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
        <https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>>>

                                                       It sounded like
        Michael
                     was ok with
                                  working with/around heap
                                                       initialization.

                                                       Michael, is that
        right or
                     should we
                                  still consider this on
                                                       the table?

                                                       Val

                                                       On 08/26/2014
        09:34 AM,
                     Gabe Becker
                     wrote:

                                                           Val,

                                                           Has there
        been any
                     movement on
                                  this? This remains a
                                                           substantial
                                                           bottleneck
        for us when
                     writing very
                                  large VCF files (e.g.

        variants+genotypes for
                     whole genome
                                  NGS samples).

                                                           I was able to
        see a
                     ~25% speedup
                                  with 4 cores and  an
                                                           "optimal" speedup
                                                           of ~2x with 10-12
                     cores for a VCF
                                  with 500k rows  using
                                                           a very naive
                                                           parallelization
                     strategy and no
                                  other changes. I suspect
                                                           this could be
                                                           improved on
        quite a
                     bit, or
                                  possibly made irrelevant
                                                           with
        judicious use
                                                           of serial C code.

                                                           Did you and
        Martin
                     make any plans
                                  regarding optimizing
                                                           writeVcf?

                                                           Best
                                                           ~G

                                                           On Tue, Aug
        5, 2014 at
                     2:33 PM,
                                  Valerie Obenchain

        <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
                     <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
                                  <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>
                     <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>>> <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>
                     <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
                                  <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>
                     <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>>>>

                       <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org> <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>>
                                  <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>
                     <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>>> <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>
                     <mailto:vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>>
                                  <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>
                     <mailto:vobencha at fhcrc.org
        <mailto:vobencha at fhcrc.org>>>>>__>

                                                           wrote:

                                                                 Hi Michael,

                                                                 I'm
        interested
                     in working on
                                  this. I'll discuss
                                                           with Martin next
                                                                 week
        when we're
                     both back in
                                  the office.

                                                                 Val

                                                                 On 08/05/14
                     07:46, Michael
                                  Lawrence wrote:

                                                                     Hi guys
                     (Val, Martin,
                     Herve):

        Anyone have
                     an itch for
                                  optimization? The
                                                           writeVcf
        function is

        currently a

        bottleneck
                     in our WGS
                                  genotyping pipeline. For
                                                           a typical 50

        million row

        gVCF, it was
                     taking 2.25
                                  hours prior to
                                                           yesterday's
        improvements

                       (pasteCollapseRows) that
                                  brought it down to
                                                           about 1 hour,
        which
                                                                     is
        still
                                                                     too
        long by
                     my standards
                                  (> 0). Only takes 3
                                                           minutes to
        call the

        genotypes

        (and associated
                                  likelihoods etc) from the
                                                           variant calls
        (using
                                                                     80
        cores and
                                                                     450
        GB RAM
                     on one node),
                                  so the output is an
                                                           issue. Profiling

        suggests that
                                                                     the
        running
                     time scales
                                  non-linearly in the
                                                           number of rows.

        Digging a
                     little deeper,
                                  it seems to be
                                                           something
        with R's

        string/memory

        allocation.
                     Below,
                                  pasting 1 million strings
                                                           takes 6
        seconds, but
                                                           10
                                                                     million
                     strings takes
                                  over 2 minutes. It gets
                                                           way worse with 50

        million. I

        suspect it
                     has something
                                  to do with R's string
                                                           hash table.

        set.seed(1000)
                                                                     end <-
                     sample(1e8, 1e6)

                       system.time(paste0("END",
                                  "=", end))

          user
                     system elapsed

        6.396
                       0.028   6.420

                                                                     end <-
                     sample(1e8, 1e7)

                       system.time(paste0("END",
                                  "=", end))

          user
                     system elapsed
                                                                     134.714
                       0.352 134.978

        Indeed, even
                     this takes a
                                  long time (in a
                                                           fresh session):

        set.seed(1000)
                                                                     end <-
                     sample(1e8, 1e6)
                                                                     end <-
                     sample(1e8, 1e7)

                                  system.time(as.character(end))

          user
                     system elapsed

          57.224
                       0.156  57.366

                                                                     But
        running
                     it a second
                                  time is faster (about
                                                           what one would

        expect?):

                       system.time(levels <-
                                  as.character(end))

          user
                     system elapsed

          23.582
                       0.021  23.589

                                                                     I
        did some
                     simple
                                  profiling of R to find that
                                                           the resizing of
                                                                     the
        string

        hash table
                     is not a
                                  significant component of
                                                           the time. So
        maybe

        something
                                                                     to
        do with
                     the R heap/gc?
                                  No time right now to
                                                           go deeper. But I

        know Martin

        likes this
                     sort of
                     thing ;)

                                                                     Michael

                     [[alternative
                                  HTML version deleted]]

        _________________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
        <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>>
                     <mailto:Bioc-devel at r-project.
        <mailto:Bioc-devel at r-project.>____org
                     <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>>>

                       <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>.
                     <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>.>______org
                                  <mailto:Bioc-devel at r-project.
        <mailto:Bioc-devel at r-project.>____org
                     <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>>>>

                       <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project> <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>>.
                                  <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>
                     <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>>.__>______org

                       <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>.
                     <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>.>______org
                                  <mailto:Bioc-devel at r-project.
        <mailto:Bioc-devel at r-project.>____org
                     <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>>>>>

        mailing list

        https://stat.ethz.ch/mailman/__________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>>

        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>>

        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>>

        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
                     <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>>

        _________________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
        <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>>
                     <mailto:Bioc-devel at r-project.
        <mailto:Bioc-devel at r-project.>____org
                     <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>>>

                       <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>.
                     <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>.>______org
                                  <mailto:Bioc-devel at r-project.
        <mailto:Bioc-devel at r-project.>____org
                     <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>>>>

                       <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project> <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>>.
                                  <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>
                     <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>>.__>______org

                       <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>.
                     <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>.>______org
                                  <mailto:Bioc-devel at r-project.
        <mailto:Bioc-devel at r-project.>____org
                     <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>>>>> mailing
                                                           list
        https://stat.ethz.ch/mailman/__________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>>

        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>>

        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>>

        <https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
                     <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>>

                                                           --
                                                           Computational
        Biologist
                                                           Genentech
        Research

          [[alternative HTML
                     version
                     deleted]]

                     _______________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
        <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>>
                     <mailto:Bioc-devel at r-project.
        <mailto:Bioc-devel at r-project.>____org
                     <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>>>
                                  <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>.
                     <mailto:Bioc-devel at r-project
        <mailto:Bioc-devel at r-project>.>______org
                                  <mailto:Bioc-devel at r-project.
        <mailto:Bioc-devel at r-project.>____org
                     <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>>>>
                                                   mailing list
        https://stat.ethz.ch/mailman/________listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>>

        <https://stat.ethz.ch/mailman/______listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>

        <https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
                     <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>>>

                                               --
                                               Computational Biology / Fred
                     Hutchinson Cancer
                                  Research Center
                                               1100 Fairview Ave. N.
                                               PO Box 19024 Seattle, WA
        98109

                                               Location: Arnold Building
        M1 B861
                                               Phone: (206) 667-2793
        <tel:%28206%29%20667-2793>
                     <tel:%28206%29%20667-2793>
                                  <tel:%28206%29%20667-2793>
                     <tel:%28206%29%20667-2793>

                                       --
                                       Computational Biologist
                                       Genentech Research

                                  --
                                  Computational Biologist
                                  Genentech Research

                              --
                              Computational Biology / Fred Hutchinson Cancer
                     Research Center
                              1100 Fairview Ave. N.
                              PO Box 19024 Seattle, WA 98109

                              Location: Arnold Building M1 B861
                              Phone: (206) 667-2793
        <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793>
                     <tel:%28206%29%20667-2793>

                     --
                     Computational Biologist
                     Genentech Research

                 ___________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
        <mailto:Bioc-devel at r-project.__org
        <mailto:Bioc-devel at r-project.org>>
                 mailing list
        https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
                 <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>

        --
        Computational Biologist
        Genentech Research

--
Computational Biologist
Genentech Research