robert.
On 3/16/15 9:00 PM, Kasper Daniel Hansen wrote:
There would be a LOT of value in having core packages use exactly the
same representation; I don't have any opinion about which one is
best.
Kasper
On Mon, Mar 16, 2015 at 3:38 PM, Herv? Pag?s <hpages at fredhutch.org
<mailto:hpages at fredhutch.org> <hpages at fredhutch.org>> wrote:
On 03/16/2015 09:22 AM, Michael Lawrence wrote:
Yes, I think it would make sense for the Xtra package to follow the
established convention in VariantAnnotation/VCF.
I don't know. I agree it would be nice to use a more consistent
representation of insertions across the software but I'm not
convinced we should necessarily follow the VCF way, which is to
include the base before the event in the ref and alt alleles as well
as in the reported range.
Note that there doesn't seem to be any consensus in the broader
Bioinformatics community either. For example dbSNP and HGVS report
the range that corresponds to the 2 flanking nucleotides but they
don't include these nucleotides in the ref or alt alleles. VCF does
the same as GFF3 which says "start equals end and the implied site is
to the right of the indicated base" except that VCF wants to treat
events that occur at position 1 in a special way. In that case VCF
says the base *after* the event should be included (seems like the
VCF authors want to avoid both: empty ranges and also ranges that
start at POS 0). BTW it doesn't seem that
VariantAnnotation::isInsertion() is aware of that special treatment.
UCSC uses a zero-width range, and so does the XtraSNPlocs.*
packages. The advantage of this representation is its simplicity.
There is no special cases. It also reflects the notion that an
insertion is a replacement of an empty string with a non-empty
string. No need to include flanking nucleotides in the representation
(which is artificial and distorts the "real" alt allele).
H.
On Mon, Mar 16, 2015 at 9:16 AM, Robert Castelo
<robert.castelo at upf.edu <mailto:robert.castelo at upf.edu>
<robert.castelo at upf.edu>> wrote:
dear devel people, specially Val and Michael,
Herv? has recently added an annotation package that includes
non-SNVs variants from dbSNP, it is called:
library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
if you execute the corresponding example you'll see the kind of
information stored in the package:
example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
if you pay attention to the following case:
my_snps1[1:2] GRanges object with 2 ranges and 3 metadata columns:
seqnames ranges strand | RefSNP_id alleles <Rle>
<IRanges> <Rle> | <character> <character> [1] 22 [10513380,
10513380] - | rs386831164 -/T [2] 22 [10519678,
10519677] + | rs71286731 -/TTT ref_allele <DNAStringSet>
[1] T [2] - ------- seqinfo: 25 sequences
(1 circular) from GRCh38 genome
you'll see the first variant (rs386831164) is a deletion of one
nucleotide and the second (rs71286731) is an insertion of three
nucleotides (TTT).
it struck me that the ranges representing the insertion had an start
position one nucleotide larger than then and i contacted Herv?
thinking that this was a mistake. however, i've learned from him that
these are so-called "zero-width" ranges and they actually allow to
distinguish insertions from every other type of variant without the
need to know anything about the reference or the alternate allele.
currently, the VCF specification 4.2 (http://samtools.github.io/
hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition of the
REF column to help distinguishing insertions by including the
flanking nucleotide of the inserted sequence. As a result,
VariantAnnotation::readVcf() produces ranges that mimic this
standard having identical start and end positions leading to 1-width
ranges:
fl <- system.file("extdata", "CEUtrio.vcf.bgz",
package="VariantFiltering") vcf <- readVcf(fl, genome="hg19")
rowRanges(vcf[isInsertion(vcf), ])[1:2] GRanges object with 2 ranges
and 5 metadata columns: seqnames ranges strand |
paramRangeID <Rle> <IRanges> <Rle> | <factor>
rs11474033 20 [45093330, 45093330] * | <NA>
20:47592746_G/GGC 20 [47592746, 47592746] * |
<NA> REF ALT QUAL FILTER <DNAStringSet>
<DNAStringSetList> <numeric> <character> rs11474033 C
CTTCT 2901.12 . 20:47592746_G/GGC G
GGC 608.88 . ------- seqinfo: 84 sequences from hg19
genome
table(width(rowRanges(vcf[isInsertion(vcf), ])))
1 78
i would like to ask whether it would make sense to harmonize the way
in which dbSNP insertions and variants are imported into Bioconductor
by making VariantAnnotation::readVcf() to produce zero-width ranges
for insertion variants. this not a feature request, i only would like
to know what whether there is a particular reason not to use the
available zero-width ranges that seem to be implemented for this
purpose.
cheers,
robert.