dear devel people, specially Val and Michael,
Herv? has recently added an annotation package that includes non-SNVs
variants from dbSNP, it is called:
library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
if you execute the corresponding example you'll see the kind of
information stored in the package:
example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
if you pay attention to the following case:
my_snps1[1:2]
GRanges object with 2 ranges and 3 metadata columns:
seqnames ranges strand | RefSNP_id alleles
<Rle> <IRanges> <Rle> | <character> <character>
[1] 22 [10513380, 10513380] - | rs386831164 -/T
[2] 22 [10519678, 10519677] + | rs71286731 -/TTT
ref_allele
<DNAStringSet>
[1] T
[2] -
-------
seqinfo: 25 sequences (1 circular) from GRCh38 genome
you'll see the first variant (rs386831164) is a deletion of one nucleotide
and the second (rs71286731) is an insertion of three nucleotides (TTT).
it struck me that the ranges representing the insertion had an start
position one nucleotide larger than then and i contacted Herv? thinking
that this was a mistake. however, i've learned from him that these are
so-called "zero-width" ranges and they actually allow to distinguish
insertions from every other type of variant without the need to know
anything about the reference or the alternate allele.
currently, the VCF specification 4.2 (http://samtools.github.io/
hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition of the REF
column to help distinguishing insertions by including the flanking
nucleotide of the inserted sequence. As a result,
VariantAnnotation::readVcf() produces ranges that mimic this standard
having identical start and end positions leading to 1-width ranges:
fl <- system.file("extdata", "CEUtrio.vcf.bgz", package="VariantFiltering")
vcf <- readVcf(fl, genome="hg19")
rowRanges(vcf[isInsertion(vcf), ])[1:2]
GRanges object with 2 ranges and 5 metadata columns:
seqnames ranges strand | paramRangeID
<Rle> <IRanges> <Rle> | <factor>
rs11474033 20 [45093330, 45093330] * | <NA>
20:47592746_G/GGC 20 [47592746, 47592746] * | <NA>
REF ALT QUAL FILTER
<DNAStringSet> <DNAStringSetList> <numeric> <character>
rs11474033 C CTTCT 2901.12 .
20:47592746_G/GGC G GGC 608.88 .
-------
seqinfo: 84 sequences from hg19 genome
table(width(rowRanges(vcf[isInsertion(vcf), ])))
1
78
i would like to ask whether it would make sense to harmonize the way in
which dbSNP insertions and variants are imported into Bioconductor by
making VariantAnnotation::readVcf() to produce zero-width ranges for
insertion variants. this not a feature request, i only would like to know
what whether there is a particular reason not to use the available
zero-width ranges that seem to be implemented for this purpose.
cheers,
robert.