Skip to content
Prev 6160 / 21312 Next

[Bioc-devel] writeVcf performance

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now 
takes about 17 minutes.

 > hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

 > param = ScanVcfParam(info=NA)
 > vcf = readVcf(fl, "", param=param)
 > dim(vcf)
[1] 51612762        1

 > system.time(writeVcf(vcf, "out.vcf"))
     user   system  elapsed
  971.032    6.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't speed 
things up enough we were planning to implement 'chunking' through the 
VCF and/or move the parsing of info to C, however, it looks like geno 
was the bottleneck.

I've tested a number of samples/fields combinations in files with >= .5 
million rows and the improvement over writeVcf() in release is ~ 90%.

Valerie
On 09/04/14 15:28, Valerie Obenchain wrote: