Skip to content
Prev 7133 / 21312 Next

[Bioc-devel] zero-width ranges representing insertions

On 03/16/2015 09:22 AM, Michael Lawrence wrote:
I don't know. I agree it would be nice to use a more consistent
representation of insertions across the software but I'm not convinced
we should necessarily follow the VCF way, which is to include the base
before the event in the ref and alt alleles as well as in the reported
range.

Note that there doesn't seem to be any consensus in the broader
Bioinformatics community either. For example dbSNP and HGVS report the
range that corresponds to the 2 flanking nucleotides but they don't
include these nucleotides in the ref or alt alleles. VCF does the same
as GFF3 which says "start equals end and the implied site is to the
right of the indicated base" except that VCF wants to treat events that
occur at position 1 in a special way. In that case VCF says the base
*after* the event should be included (seems like the VCF authors want
to avoid both: empty ranges and also ranges that start at POS 0).
BTW it doesn't seem that VariantAnnotation::isInsertion() is aware of
that special treatment.

UCSC uses a zero-width range, and so does the XtraSNPlocs.* packages.
The advantage of this representation is its simplicity. There is no
special cases. It also reflects the notion that an insertion is
a replacement of an empty string with a non-empty string. No need
to include flanking nucleotides in the representation (which is
artificial and distorts the "real" alt allele).

H.