Skip to content

[Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

12 messages · Robert Castelo, Michael Lawrence, Valerie Obenchain

#
Michael,

regarding our email exchange three weeks ago, I found a couple of places 
in VariantAnnotation that IMO need to be updated to avoid enforcing 
strand on VRanges.

these places occur when constructing and validating VRanges objects, 
concretely at:

1. file R/methods-VRanges-class.R at the VRanges class constructor:

VRanges <-
   function(seqnames = Rle(), ranges = IRanges(),
            ref = character(), alt = NA_character_,
            totalDepth = NA_integer_, refDepth = NA_integer_,
            altDepth = NA_integer_, ..., sampleNames = NA_character_,
            softFilterMatrix = FilterMatrix(matrix(nrow = length(gr), 
ncol = 0L),
              FilterRules()),
            hardFilters = FilterRules())
{
   gr <- GRanges(seqnames, ranges,
                 strand = .rleRecycleVector("*", length(ranges)), ...)
[...]

that precludes setting the strand at construction time:

library(VariantAnnotation)
VRanges(seqnames="chr1", ranges=IRanges(1, 5), ref="T", alt="C", strand="-")
Error in GRanges(seqnames, ranges, strand = .rleRecycleVector("*", 
length(ranges)),  :
   formal argument "strand" matched by multiple actual arguments


2. R/AllClasses.R at the VRanges class validity function .valid.VRanges():

.valid.VRanges.strand <- function(object) {
   if (any(object at strand == "-"))
     paste("'strand' must always be '+' or '*'")
}

[...]

.valid.VRanges <- function(object)
{
   c(.valid.VRanges.ref(object),
     .valid.VRanges.alt(object),
     .valid.VRanges.strand(object),
     .valid.VRanges.depth(object))
}

that prompts an error when variants annotated on the negative strand are 
detected:

library(VariantAnnotation)
example(VRanges)
strand(vr) <- "-"
c(vr)
Error in validObject(.Object) :
   invalid class ?VRanges? object: 'strand' must always be '+' or '*'


cheers,

robert.
On 05/22/2015 09:49 PM, Michael Lawrence wrote:

  
    
#
VRanges is supposed to enforce strand. The goal is to use "*" always,
for simplicity and consistency with the result of readVcf(). Is there
a use case for negative strand variants?
On Wed, Jun 10, 2015 at 5:54 AM, Robert Castelo <robert.castelo at upf.edu> wrote:
#
my understanding was that VRanges is a container for variants and 
variant annotations and strand is just one annotation more. when we use 
locateVariants() a variant can be annotated to multiple transcripts 
where in one overlaps an exon, in another an intron and so on. In all 
those transcripts annotations there is a strand annotation, the strand 
of the transcript. if the transcript is annotated in the negative strand 
of the reference chromosome then the annotation of a transcript region 
to a variant is going to be also on the negative strand.

both locateVariants() and predictCoding() return GRanges objects with 
strand annotations according to the transcripts being annotated. I 
thought it made sense in VariantFiltering to use VRanges objects as a  
container for variants and annotations and, for this reason, I would 
like to carry on the strand annotation into the VRanges object. Is there 
a strong reason for a VRanges object, derived from GRanges, not to have 
strand?
On 6/10/15 6:26 PM, Michael Lawrence wrote:
#
I guess it depends on what the strand should mean. Would having a
negative strand indicate that the ref/alt should be complemented? I'm
not sure it's a good idea to conflate the strand of the variant itself
with the strand of an overlapping feature.
On Wed, Jun 10, 2015 at 1:17 PM, Robert Castelo <robert.castelo at upf.edu> wrote:
#
one option for me is just to add a metadata column with the strand of 
the overlapping feature. however, i'm interested to fully understand the 
rationale behind this aspect of the design of the VRanges object.

a VRanges object unrolls variants in a VCF file per alternative allele 
and sample. variants in VCF files are obtained from tallying reads 
aligned on a reference genome. so, my understanding is that the 
reference allele is the allele of the reference genome against which the 
reads were aligned while the alternate allele(s) are allele calls 
different from the reference. from this perspective, my interpretation 
is that ref and alt alleles have already a strand, which is the strand 
of the reference chromosome against which the reads were aligned to. i'm 
interested in this interpretation of the strand of the variants because 
i'm interested in the interpretation of sequence-features containing the 
reference and the alternate alleles, such as differences in a binding 
site with the reference and the alternate allele.

if we relax the meaning of elements in a VRanges object to, not only 
variants x allele x sample, but to variants x allele x sample x 
annotated-feature, then i think it would make sense to have the 
strand-specific annotation in the strand slot of the VRanges object.

while this idea may be good or not for a number of reasons, i'm now 
mostly interested in knowing whether i'm misinterpreting the design of 
VRanges objects, and maybe variant calling in general or i'm in a more 
or less right path in using a VRanges object to hold variant annotations.


thanks!!!

robert.
On 06/11/2015 04:30 AM, Michael Lawrence wrote:

  
    
#
The fact that the position describes the variant, but the strand
refers to the transcript is confusing to me. What is the concrete use
case for merging the two features like that? VRanges constrains its
strand for at least 2 reasons: (1) to be less error prone [of course
this runs completely counter to flexibility] and (2) simplicity [we
don't have to worry about what "-" means for ref/alt, overlap, etc].
On Thu, Jun 11, 2015 at 6:05 AM, Robert Castelo <robert.castelo at upf.edu> wrote:
#
Of course, the inclusion of strand would imply an interpretation of the 
variant and its strand (e.g., "-") with respect to an annotated feature. 
I can see a practical problem of integrity of the information on a 
VRanges object, by which a mandatory column, such as strand, depends on 
a non-mandatory column, such as some feature annotation stored as a 
metadata column.

A solution would be to add the transcript identifier (TXID) as mandatory 
column on the VRanges object but I suspect this is a big change to do, 
so adding a LOCSTRAND column (next to LOCSTART and LOCEND generated by 
locateVariants) in the metadata columns of the VRanges object would 
allow me to use a VRanges object as a container of variant x allele x 
sample x annotation.

Just to clear up the issue of merging strand and variant: a noisy 
variant (a variant that is not silent) and has a, e.g., loss-of-function 
effect such as the gain of a stop codon, is usually interpreted in the 
strand of the transcript and coding sequence in which the stop codon is 
gained, saying something like and A changed to a T producting the stop 
codon TAA. Ref and alt alleles are called in the strand of the reference 
chromosome, so if the transcript was annotated in the negative strand, 
we would know that we need to reverse-complement ref and alt to 
interpret the variant, although I see no need to do anything on the 
VRanges object to ref and alt because we know they are always in the 
strand of the reference chromosome. Only if you want to detect this 
stop-gain event (with predictCoding) then you would have to 
reverse-complement the ref and alt alleles. Conversely, if the variant 
falls in an intergenic region, then obviously the strand plays no role 
in the interpretation of the variant and nothing needs to be done when 
interpreting the ref and alt alleles.
On 6/11/15 5:47 PM, Michael Lawrence wrote:
#
I didn't realize that locateVariants() returned an object with its
strand matching that of the subject. I would have expected the subject
strand to be stored in a LOCSTRAND column, as you suggest. Anyway, it
sounds like you want to merge the locateVariants() output with the
input. Merging the output strand as LOCSTRAND on the VRanges sounds
like a reasonable approach, for now. I don't know if Val is listening,
but it sounds like it would be nice to have convenient functions for
merging locateVariants() output with its input. The one for VRanges
might do something like the above.

Michael
On Thu, Jun 11, 2015 at 9:14 AM, Robert Castelo <robert.castelo at upf.edu> wrote:
#
In fact, a feature like locateVariants() returning an object of the same 
class of the input query argument would be (IMO) a useful feature at 
least in the case of an input CollapsedVCF object, because the user 
could then dump directly the annotations to a VCF file with writeVcf(), 
which is one way to store variant annotations into files.
On 6/11/15 6:38 PM, Michael Lawrence wrote:
#
locateVariants(), predictCoding() and the family of mapToTranscripts() 
functions all return strand according to the annotation matched. The 
only time the strand of the output could possibly be different from the 
strand of the input 'query' is when 'ignore.strand = TRUE' (FALSE by 
default).

I wouldn't think you (Robert) are using 'ignore.strand = TRUE', are you? 
By just using the default, the output will have the same strand as the 
input 'query' (unless 'query' is '*' of course).

That said, do you still feel it's necessary to add a LOCSTRAND column to 
the output?

Val
On 06/11/2015 09:38 AM, Michael Lawrence wrote:

  
    
#
Val,

I wasn't suggesting that LOCSTRAND be added to the locateVariants()
output. Rather, it would be added to the VRanges during the merge.

Michael

On Thu, Jun 11, 2015 at 10:10 AM, Valerie Obenchain
<vobencha at fredhutch.org> wrote:
#
I see. So the merge functions would just add the output of 
locateVariants() to the GRanges or VCF used as 'query'. Or I guess, as 
Robert suggested, locateVariants() could return the same object as 'query'.

Val
On 06/11/2015 10:19 AM, Michael Lawrence wrote: