Skip to content

[Bioc-devel] VCF Intersection Using readVcf Remarkably Slow

4 messages · Dario Strbenac, Michael Lawrence, Vincent Carey +1 more

#
Good day,

file <- system.file("extdata", "chr22.vcf.gz", package = "VariantAnnotation")
anotherFile <- system.file("extdata", "hapmap_exome_chr22.vcf.gz", package = "VariantAnnotation")
aSet <- readVcf(file, "hg19")
system.time(commonMutations <- readVcf(anotherFile, "hg19", rowRanges(aSet)))
   user  system elapsed 
209.120  16.628 226.083 

Reading in the Exome chromosome 22 VCF and intersecting it with the other file in the data directory takes almost 4 minutes.

However, reading in the whole file is much faster.
user  system elapsed 
  0.376   0.016   0.392 

and doing the intersection manually takes a fraction of a second
user  system elapsed 
  0.128   0.000   0.129

This comparison ignores the finer details such as the identities of the alleles, but does it have to be so slow ? My real use case is intersecting dozens of VCF files of cancer samples with the ExAC consortium's VCF file that is 4 GB in size when compressed. I can't imagine how long that would take.

Can the code of readVcf be optimised ?

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia
#
I think the basic problem is that each range requires a separate query
through tabix. BAM and tabix are designed to be fast for single
queries, like what a genome browser might generate, but not for
querying thousands of regions at once. At least that's the way it
seems to me. The index is only at the block level, because the data
are compressed.  In principle, smarter caching could speed this up,
but that belongs at the samtools level.

To make many queries, it pays to load the data into memory, or a more
efficient on-disk representation (HDF5, GDS, ...), first.

Michael

On Tue, Sep 27, 2016 at 3:00 PM, Dario Strbenac
<dstr7320 at uni.sydney.edu.au> wrote:
#
Dario's computer is faster than mine
rowRanges(aSet)))

   user  system elapsed

426.271  57.296 483.766

The disk infrastructure is a determinant of throughput.  Most VCF queries
are decomposable and can be parallelized.  After

chunking the indices of aSet to 20 chunks
rowRanges(aSet)[x]), mc.cores=20))

   user  system elapsed

628.307 322.830  51.303


As far as I can tell, the answers agree.  Is this a risky approach?  Also
the payload of readVcf can be reduced with a ScanVcfParam, that might
improve throughput.

On Tue, Sep 27, 2016 at 6:23 PM, Michael Lawrence <lawrence.michael at gene.com

  
  
#
On 09/27/2016 06:00 PM, Dario Strbenac wrote:
iterate through the file using yieldSize to manage memory, e.g.,

rngs <- rowRanges(aSet)
genome(rngs) <- "b37"

GenomicFiles::reduceByYield(
     VcfFile(anotherFile, yieldSize=10000),
     readVcf,
     function(YIELD) subsetByOverlaps(YIELD, rngs),
     c)
This email message may contain legally privileged and/or...{{dropped:2}}