Skip to content

[Bioc-devel] writeVcf performance

15 messages · Martin Morgan, Michael Lawrence, Gabriel Becker +3 more

#
On 08/27/2014 11:56 AM, Gabe Becker wrote:
I think Val is arriving at a (much) more efficient implementation, but...

I wanted to share my guess that the poor _scaling_ is because the garbage 
collector runs multiple times as the different strings are pasted together, and 
has to traverse, in linear time, increasing numbers of allocated SEXPs. So times 
scale approximately quadratically with the number of rows in the VCF

An efficiency is to reduce the number of SEXPs in play by writing out in chunks 
-- as each chunk is written, the SEXPs become available for collection and are 
re-used. Here's my toy example

time.R
======
splitIndices <- function (nx, ncl)
{
     i <- seq_len(nx)
     if (ncl == 0L)
         list()
     else if (ncl == 1L || nx == 1L)
         list(i)
     else {
         fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
         breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
         structure(split(i, cut(i, breaks, labels=FALSE)), names = NULL)
     }
}

x = as.character(seq_len(1e7)); y = sample(x)
if (!is.na(Sys.getenv("SPLIT", NA))) {
     idx <- splitIndices(length(x), 20)
     system.time(for (i in idx) paste(x[i], y[i], sep=":"))
} else {
     system.time(paste(x, y, sep=":"))
}


running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f time.R the 
relevant time is

    user  system elapsed
  15.320   0.064  15.381

versus with $ R --no-save --quiet -f time.R it is

    user  system elapsed
  95.360   0.164  95.511

I think this is likely an overall strategy when dealing with character data -- 
processing in independent chunks of moderate (1M?) size (enabling as a 
consequence parallel evaluation in modest memory) that are sufficient to benefit 
from vectorization, but that do not entail allocation of large numbers of in-use 
SEXPs.

Martin

  
    
#
Yes, it's very clear that the scaling is non-linear, and Gabe has been
experimenting with a chunk-wise + parallel algorithm. Unfortunately there
is some frustrating overhead with the parallelism. But I'm glad Val is
arriving at something quicker.

Michael
On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:

            

  
  
2 days later
#
Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into R)
stuff.

ftp://platgene:G3n3s4me at ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz took
about ~4.2 hrs to write out, and is about 1.5x the size of the files we are
actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence <lawrence.michael at gene.com>
wrote:

  
    
#
Thanks Gabe. I should have something for you on Monday.

Val
On 09/04/2014 01:56 PM, Gabe Becker wrote:
#
This approach, writing in chunks, is the same Herve and I used for writing
FASTA in the Biostrings package, although I see that Herve has now replaced
the R implementation with a C implementation.  I similarly found an
absolutely huge speed up when writing genomes, by chunking.

Best,
Kasper
On Tue, Sep 2, 2014 at 4:33 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:

            

  
  
3 days later
#
The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now 
takes about 17 minutes.

 > hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

 > param = ScanVcfParam(info=NA)
 > vcf = readVcf(fl, "", param=param)
 > dim(vcf)
[1] 51612762        1

 > system.time(writeVcf(vcf, "out.vcf"))
     user   system  elapsed
  971.032    6.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't speed 
things up enough we were planning to implement 'chunking' through the 
VCF and/or move the parsing of info to C, however, it looks like geno 
was the bottleneck.

I've tested a number of samples/fields combinations in files with >= .5 
million rows and the improvement over writeVcf() in release is ~ 90%.

Valerie
On 09/04/14 15:28, Valerie Obenchain wrote:
#
Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:

  
    
#
fyi Martin found a bug with the treatment of list data (ie, Number = 
'.') in the header. Working on a fix ...

Val
On 09/08/2014 08:43 AM, Gabe Becker wrote:
1 day later
#
Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the 
process of moving SimpleList and DataFrame from IRanges to S4Vectors; 
finished up today I think. Anyhow, if you get VariantAnnotation from svn 
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe 
rtracklayer).

I'm working on the 'chunking' approach next. It looks like we can still 
gain from adding that.

Valerie
On 09/08/2014 12:00 PM, Valerie Obenchain wrote:
#
Hi Val,
On 09/09/2014 02:12 PM, Valerie Obenchain wrote:
I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check'
failed for me because of an unit test error in test_VRanges_vcf().
Here is how to quickly reproduce:

   library(VariantAnnotation)
   library(RUnit)
   source("path/to/VariantAnnotation/inst/unitTests/test_VRanges-class.R")

   dest <- tempfile()
   vr <- make_TARGET_VRanges_vcf()
   writeVcf(vr, dest)
   vcf <- readVcf(dest, genome = "hg19")
   perm <- c(1, 7, 8, 4, 2, 10)
   vcf.vr <- as(vcf, "VRanges")[perm]
   genome(vr) <- "hg19"
   checkIdenticalVCF(vr, vcf.vr)  # Error in checkIdentical(orig, vcf) : 
FALSE

Hard for me to tell whether this is related to DataFrame moving from 
IRanges to S4Vectors or to a regression in writeVcf(). Do you think
you can have a look? Thanks for the help and sorry for the trouble.

H.

  
    
#
Hi Herve,

This unit test passes in VA 1.11.30 (the current version in svn). It was 
related to writeVcf(), not the IRanges/S4Vector stuff. My fault, not yours.

Val
On 09/09/2014 02:47 PM, Herv? Pag?s wrote:

  
    
#
Ah, ok. I should have 'svn up' and re-tried 'R CMD check' before
reporting this. Thanks and sorry for the noise.

H.
On 09/09/2014 03:09 PM, Valerie Obenchain wrote:

  
    
7 days later
#
Hi Gabe,

Have you had a chance to test writeVcf? The changes made over the past 
week have shaved off more time. It now takes ~ 9 minutes to write the 
NA12877 example.
In the most recent version (1.11.35) I've added chunking for files with 
 > 1e5 records. Right now the choice of # records per chunk is simple, 
based on total records only. We are still experimenting with this. You 
can override default chunking with 'nchunk'. Examples on the man page.

Valerie
On 09/08/14 08:43, Gabe Becker wrote:
12 days later
#
Valerie,

Apologies for this taking much longer than it should have. The changes in
Bioc-devel have wreaked havoc on the code we use to to generate and process
the data we need to write out, but the fault is mine for not getting on top
of it sooner.

I'm not seeing the speed you mentioned above in the latest devel version
(1.11.35). It took ~1.5hrs to write the an expanded vcf with  56M rows
(print output and sessionInfo() follow). I'll try reading in the illumina
platinum and writing it back out to see if it is something about our
specific vcf object (could ExpandedVCF vs VCF be an issue?).
*class: ExpandedVCF *
*dim: 50307989 1 *
rowData(vcf):
  GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 1 column: END
  Fields with no header: END
geno(vcf):
  SimpleList of length 7: AD, DP, FT, GT, GQ, PL, MIN_DP
geno(header(vcf)):
          Number Type    Description

   AD     2      Integer Allelic depths (number of reads in each observed
al...
   DP     1      Integer Total read depth

   FT     1      String  Variant filters

   GT     1      String  Genotype

   GQ     1      Integer Genotype quality

   PL     3      Integer Normalized, Phred-scaled likelihoods for genotypes

   MIN_DP 1      Integer Minimum DP observed within the GVCF block
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] VariantCallingPaper_0.0.3 GenomicFeatures_1.17.17
 [3] AnnotationDbi_1.27.16     Biobase_2.25.0
 [5] gmapR_1.7.8               VTGenotyping_0.0.1
 [7] BiocParallel_0.99.22      futile.logger_1.3.7
 [9] VariantTools_1.7.5        *VariantAnnotation_1.11.35*
[11] Rsamtools_1.17.34         Biostrings_2.33.14
[13] XVector_0.5.8             rtracklayer_1.25.16
[15] GenomicRanges_1.17.42     GenomeInfoDb_1.1.23
[17] IRanges_1.99.28           S4Vectors_0.2.4
[19] BiocGenerics_0.11.5       switchr_0.2.1

loaded via a namespace (and not attached):
 [1] annotate_1.43.5                base64enc_0.1-2
 [3] BatchJobs_1.4                  BBmisc_1.7
 [5] biomaRt_2.21.1                 bitops_1.0-6
 [7] brew_1.0-6                     BSgenome_1.33.9
 [9] CGPtools_2.2.0                 checkmate_1.4
[11] codetools_0.2-9                DBI_0.3.1
[13] DESeq_1.17.0                   digest_0.6.4
[15] fail_1.2                       foreach_1.4.2
[17] futile.options_1.0.0           genefilter_1.47.6
[19] geneplotter_1.43.0             GenomicAlignments_1.1.29
[21] genoset_1.19.32                gneDB_0.4.18
[23] grid_3.1.1                     iterators_1.0.7
[25] lambda.r_1.1.6                 lattice_0.20-29
[27] Matrix_1.1-4                   RColorBrewer_1.0-5
[29] RCurl_1.95-4.3                 rjson_0.2.14
[31] RSQLite_0.11.4                 sendmailR_1.2-1
[33] splines_3.1.1                  stringr_0.6.2
[35] survival_2.37-7                tools_3.1.1
[37] TxDb.Hsapiens.BioMart.igis_2.3 XML_3.98-1.1
[39] xtable_1.7-4                   zlibbioc_1.11.1

On Wed, Sep 17, 2014 at 2:08 PM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:

  
    
#
Hi Gabe,

It would help to have a common baseline. Please show the output for 
writing the Illumina file you sent originally:

library(VariantAnnotation)
fl <- "NA12877_S1.genome.vcf.gz"
vcf <- readVcf(fl, "", param=ScanVcfParam(info=NA))
dim(vcf)

gc()
print(system.time(writeVcf(vcf, tempfile())))
gc()

Here's my output. For this file, writing takes between 8-10 minutes 
depending on machine load. (Machine has 378 GIG of RAM.)
I'm not seeing substantial differences in writeVcf with expanded vs 
collapsed VCF. However, as an aside, note that calling writeVcf on 
VRanges (or equivalently ExpandedVCF) the new file will have multiple 
rows per position if there were multiple ALT alleles. I'm not sure this 
is technically valid as per the VCF specs but am assuming Michael knew 
about this.

Five positions in original VCF:
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, "")
 > dim(vcf)
[1] 5 3

vr <- as(vcf, "VRanges")
 > length(vr)
[1] 21

Before writing out, VRanges is coerced to ExpandedVCF; 7 rows are 
written out.
 > dim(as(vr, "VCF"))
[1] 7 3

Can you provide a portion of 'vcfgeno' for testing?

Valerie
...

        
On 09/30/2014 08:36 AM, Gabe Becker wrote: