Skip to content

[Bioc-devel] VariantAnnotation::readVcf(fl, seqinfo(scanVcfHeader(fl)) problem

8 messages · Valerie Obenchain, Robert Castelo, Michael Lawrence +1 more

#
Hi Robert,

Thanks for the bug report and reproducible example. Now fixed in release 
1.12.2 and devel 1.13.4.

I've also updated the docs to better explain how the Seqinfo objects are 
propagated / merged when supplied as 'genome'.

Valerie
On 10/23/2014 06:45 AM, Robert Castelo wrote:
#
hi Valerie,

thanks for the quick fix and updating the documentation, i have a 
further question about the seqinfo slot and particularly the use of 
seqlevelsStyle(). Let me illustrate it with an example again:


==============
library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

## read again the same VCF file
fl <- file.path(system.file("extdata", package="VariantFiltering"), 
"CEUtrio.vcf.bgz")
vcf <- readVcf(fl, seqinfo(scanVcfHeader(fl)))

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

## select the standard chromosomes
vcf <- keepStandardChromosomes(vcf)

## since the input VCF file had NCBI style, let's match
## the style of the TxDb annotations
seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)

## drop the mitochondrial chromosome (b/c of the different lengths
## between b37 and hg19
vcf <- dropSeqlevels(vcf, "chrM")

## try to annotate the location of the variants. it prompts an
## error because the 'genome' slot of the Seqinfo object still
## has b37 after running seqlevelsStyle
vcf_annot <- locateVariants(vcf, txdb, AllVariants())
Error in mergeNamedAtomicVectors(genome(x), genome(y), what = 
c("sequence",  :
   sequences chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, 
chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, 
chr20, chr21, chr22, chrX, chrY, chrM have incompatible genomes:
   - in 'x': b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, 
b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37
   - in 'y': hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, 
hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, 
hg19, hg19, hg19

## this can be fixed by setting the 'genome' slot to the values of
## the TxDb object
genome(vcf) <- genome(txdb)[intersect(names(genome(vcf)), 
names(genome(txdb)))]

## now this works
vcf_annot <- locateVariants(vcf, txdb, AllVariants())
=================

so my question is, should not seqlevelsStyle() also change the 'genome' 
slot of the Seqinfo object in the updated object?

if not, would the solution be updating the 'genome' slot in the way i 
did it?

thanks!
robert.
On 10/23/2014 11:14 PM, Valerie Obenchain wrote:

  
    
#
This is a good question. I'm not sure we want seqlevelsStyle() to also 
alter the genome value. I think it's a reasonable request but I'd like 
to open it up to discussion. I've cc'd a few others for input.

Valerie
On 10/24/14 09:05, Robert Castelo wrote:
#
I don't think a seqname style implies a specific genome build. But the
inverse might make sense. Given a genome build identifier, we could check
for consistent naming. Perhaps an option on "genome<-" could support this?



On Fri, Oct 24, 2014 at 11:52 AM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:

  
  
#
hi Michael,

if we assume that a seqname style does not imply a specific genome 
build, then i'd say that the error below about having incompatible 
genomes should not pop up because sequence styles have been already 
matched, right?
On 10/24/14 10:22 PM, Michael Lawrence wrote:

  
  
#
Not sure I understand. Setting the seqlevelsStyle cannot change the genome
build, so the two Seqinfos will remain incompatible in that way. I think
what you want is to be able to say "let's consider these two genome builds
to be the same", which seems reasonable after dropping chrM. I was
proposing that this could be done with something like:

genome(vcf, rectifySeqlevels=TRUE) <- genome(txdb)

For convenience, we might want that argument to be TRUE by default, but
that would change current behavior.


On Fri, Oct 24, 2014 at 3:41 PM, Robert Castelo <robert.castelo at upf.edu>
wrote:

  
  
#
you're right, i was just thinking about b37 (w/o MT) and hg19 but one 
can have the same seqlevelsStyle with two different builds such as hg18 
and hg19.  Well then, the solution i was using below,

genome(vcf) <- genome(txdb)[intersect(names(genome(vcf)),
names(genome(txdb)))]

was not that bad, but it's a bit cumbersome to write, something like 
what you propose below looks better to me. I also wonder why each 
chromosome should have a genome build (which forces me to intersect in 
the line before), could not we assume that all chromosomes come from the 
same build?
On 10/25/14 12:56 AM, Michael Lawrence wrote:

  
  
#
Hi Robert, Michael,
On 10/24/2014 04:21 PM, Robert Castelo wrote:
genome(vcf) <- genome(txdb)[[1]] should do the same and is simpler.
But we should be able to just do

   genome(vcf) <- genome(txdb)

instead. Unfortunately this didn't work (until a change I just
committed) because the genome() setter was being too paranoid
when checking the supplied value. I just relaxed its logic a little
so now the genome(x) <- genome(y) idiom should almost always work
(except for some rare edge cases).

This is in GenomeInfoDb 1.2.2 (release) and 1.3.6 (devel).
When the genome slot was added to Seqinfo objects, the decision was
made to make this slot parallel to the other slots (seqnames, etc...)
to support use cases where sequences come from different organisms.

Cheers,
H.