Skip to content

[Bioc-devel] restrictToSNV for VCF

16 messages · Valerie Obenchain, Julian Gehring, Stephanie M. Gogarten +5 more

#
Hi,

I've added a restrictToSNV() function to VariantAnnotation (1.9.46). The 
return value is a subset VCF object containing SNVs only. The function 
operates on CollapsedVCF or ExapandedVCF and the alt(VCF) value must be 
nucleotides (i.e., no structural variants).

A variant is considered a SNV if the nucleotide sequences in both 
ref(vcf) and alt(x) are of length 1. I have a question about how 
variants with multiple 'ALT' values should be handled.

Should we consider row 4 a SNV? One 'ALT' is length 1, the other is not.

ALT <- DNAStringSetList("A", c("TT"), c("G", "A"), c("TT", "C"))
REF <- DNAStringSet(c("G", c("AA"), "T", "G"))
Thanks.
Valerie
#
Hi Valerie,

I would consider G>C an SNV, G>TT not.  But I assume that there exists
no clear consensus on this.  How about a flag that let's the second pass
as SNV optionally, so everybody can get what one needs?

Best wishes
Julian
On 18/03/14 18:36, Valerie Obenchain wrote:
#
I would say rows 1 and 3 are SNVs, but not row 4.  For this application 
I think a variant has to be an SNV or not, as you can't pass half a 
variant.  (I suppose you could remove the ALT values with length > 1 and 
set those genotypes to missing, but that is both complicated and 
unexpected behavior.  Also, it could introduce bias into association 
testing, since you would get non-random missingness.)

isSNV() in SeqVarTools returns TRUE only if the max length of all 
alleles is 1.  It also has a logical argument "biallelic" which allows 
to select for only biallelic SNVs - that could be useful here as well. 
If biallelic=TRUE, only row 1 would make it into the subset.

Stephanie
On 3/18/14 4:04 PM, Julian Gehring wrote:
#
Thanks for the feedback.

I'll look into nchar for XStringSetList.

I'm in favor of supporting isDeletion(), isInsertion(), isIndel() and 
isSNV() for the VCF classes and removing restrictToSNV(). I could add an 
argument 'all_alt' or 'all_alt_agreement' to be used with CollapsedVCF 
in the case where not all alternate alleles meet the criteria.

Here are the current definitions:
Valerie
On 03/19/2014 01:07 PM, Vincent Carey wrote:

  
    
1 day later
#
Hi,
On 03/19/2014 01:10 PM, Michael Lawrence wrote:
This is inherited from SummarizedExperiment:

   > example(SummarizedExperiment)

   > se1
   class: SummarizedExperiment
   dim: 200 6
   exptData(0):
   assays(1): counts
   rownames: NULL
   rowData metadata column names(0):
   colnames(6): A B ... E F
   colData names(1): Treatment

   > se1[1:4]
   class: SummarizedExperiment
   dim: 4 6
   exptData(0):
   assays(1): counts
   rownames: NULL
   rowData metadata column names(0):
   colnames(6): A B ... E F
   colData names(1): Treatment

To me that means that a SummarizedExperiment has a length
(conceptually), and that this length is the number of rows.
It would actually help if a "length" method was defined:

   > length(se1)
   [1] 1

That would automatically fix many convenience [ wrappers like head(),
tail(), rev(), etc...

   > head(se1)
   class: SummarizedExperiment
   dim: 1 6
   exptData(0):
   assays(1): counts
   rownames: NULL
   rowData metadata column names(0):
   colnames(6): A B ... E F
   colData names(1): Treatment

   > rev(se1)
   class: SummarizedExperiment
   dim: 1 6
   exptData(0):
   assays(1): counts
   rownames: NULL
   rowData metadata column names(0):
   colnames(6): A B ... E F
   colData names(1): Treatment

Following that logic names(se1) also probably return colnames(se1).

H.

  
    
#
On 03/20/2014 05:20 PM, Herv? Pag?s wrote:
[...]
/\
                                should

H.

  
    
#
On 03/20/2014 05:20 PM, Herv? Pag?s wrote:
I think of a SummarizedExperiment as fundamentally a matrix with row and column 
annotations. 'length' would then be prod(dim(se1)) col- and rownames() are 
defined but names() is NULL. I guess 1-D sub-setting isn't matrix-like, but I 
don't think that removing this 'feature' simply for consistency sake is worth 
it; I guess the subsetting logical was copy/pasted from other code without 
enough thought. head(), tail() could be implemented if this were somehow useful 
(I usually use these for compact display, and that's irrelevant here...); rev() 
on a matrix doesn't do anything useful.

Martin

  
    
#
Hi Martin,
On 03/21/2014 01:45 PM, Martin Morgan wrote:
But it's not defined as such either.

Note that findOverlaps() on SummarizedExperiment objects returns a
Hits object with indices in the 1:nrow(query) or 1:nrow(subject)
range. I'd like to be able to say "in the seq_along(query) or
seq_along(subject) range" because that's what findOverlaps() does
on any other object defined in IRanges/GenomicRanges/GenomicAlignments.
But I can't because that would be inaccurate.

However, it's conceptually true: I can use the indices in the Hits
object to do 1D extractions from the query or subject. This is good
and consistent with any other type of query or subject.
I was not suggesting that.
I still find it sometimes useful to be able to do head() on a big
object when I just want to try things on a few elements first:

   > dim(vcf)
   [1] 1000000       3

   toy <- head(vcf)
   rowData(toy)
   assay(toy)
   isSNV(toy)
   findOverlaps(toy, exons)

It's more convenient and much quicker than having to truncate the
individual results:

   head(rowData(vcf))
   head(assay(vcf))
   head(isSNV(vcf))
   head(findOverlaps(vcf, exons))

I guess what I'm trying to say is that while it helps thinking of
a SummarizedExperiment as fundamentally a matrix, there are already
enough differences with the matrix API to suggest that, unlike for a
matrix, the length of a SummarizedExperiment object is its nb of rows.
It's implicit in many ways and I think that formalizing it would help
rather than hurt. It will still be somewhat a surprise for the
end-user, but not a bigger surprise than the ones s/he gets right
now with seq_along(vcf), vcf[i], isSNV(vcf), findOverlaps(), head(),
etc.. And once s/he gets over it, there won't be anymore surprises:
all these things will be in agreement with length(vcf) and behave
as expected.

Thanks,
H.