Skip to content

[Bioc-devel] VariantAnnotation::readVcf() sets the wrong seqlevelsStyle in devel

4 messages · Hervé Pagès, Robert Castelo, Manders-2, F.M.

#
Hi Robert,

The VCF file uses "22" for the chromosome name which is the name used by 
NCBI. So explicitly specifying "hg19" in the readVcf() call is like 
saying that this chromosome name is a UCSC name which is why 
seqlevelsStyle() gets confused later.

If you specify the name of the NCBI assembly, things work as expected:

   fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
   vcf <- readVcf(fl, "GRCh37")
   seqlevels(vcf)
   # [1] "22"
   seqlevelsStyle(vcf)
   # [1] "NCBI"
   seqlevelsStyle(vcf) <- "UCSC"
   seqlevels(vcf)
   # [1] "chr22"

Or, if you don't know what reference genome the file is based on, don't 
specify it:

   fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
   vcf <- readVcf(fl)
   seqlevels(vcf)
   # [1] "22"
   seqlevelsStyle(vcf)
   # [1] "NCBI"    "Ensembl"
   seqlevelsStyle(vcf) <- "UCSC"
   seqlevels(vcf)
   # [1] "chr22"

or specify it later:

   genome(vcf) <- "hg19"
   seqinfo(vcf)
   # Seqinfo object with 1 sequence from hg19 genome; no seqlengths:
   #   seqnames seqlengths isCircular genome
   #   chr22            NA         NA   hg19

Hope this helps,
H.
On 7/29/20 08:30, Robert Castelo wrote:

  
    
1 day later
#
hi Herv?,

thank you very much for your clarifications, but this behavior is 
different in release and has been different until now, this is BioC 3.11:

library(VariantAnnotation)

fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
seqlevels(vcf)
[1] "22"
seqlevelsStyle(vcf)
[1] "NCBI"??? "Ensembl"

i appreciate that the behavior now in devel is more consistent, i 
actually never understood the need to specify the 'genome="hg19"' 
argument since this in principle can be figured out from the VCF header 
information. However, the documentation has become right now confusing, 
if you go to subsection 2.1 and 2.1.2 from the introductory vignette, it 
shows using readVcf() with "hg19" but then the sequence names are 
literally what they are in the VCF file (NCBI style)

because of the large user base of VariantAnnotation (top-49 download) 
and the many possible reverse dependencies downstream, i'd suggest that 
either readVcf() issues an error or, maybe even better, overrides the 
sequence level style in the VCF file maybe with a warning, when the 
'genome' argument does not match the sequence style of the VCF file.

cheers,

robert.
On 04/08/2020 18:29, Herv? Pag?s wrote:
#
Hi Robert,

Yes seqlevelsStyle's new behavior is slightly different and less 
forgiving. The thing is that it will generally reveal dormant issues 
which is not such a bad thing after all.

Note that it doesn't seem completely straightforward to retrieve the 
reference genome/assembly directly from the VCF header. AFAICT this 
information is either missing or weirdly formatted. For example the 
headers of the 1000genomes VCF files located at 
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ 
contain

   ##reference=1000Genomes-NCBI37

or in the ex2.vcf file included in VariantAnnotation it's:

   ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta

so not clear that importing this in the genome field of the returned VCF 
object would be that helpful.

Thanks for pointing me to the VariantAnnotation vignette. I'll fix the 
calls to readVcf() to use GRCh37 instead of hg19. Seems like one call 
(on ex2.vcf) is using the wrong genome: ex2.vcf is based on hg18/NCBI36, 
not on hg19/GRCh37. Will fix that too.

Sure readVcf() could probably be improved to perform some sanity checks 
by making sure that the user-supplied genome is compatible with the 
chromosome names. However that still won't prevent the user from 
specifying the wrong genome (e.g. GRCh37 instead of NCBI36) like in the 
ex2.vcf case. Anyway this is a feature request for readVcf().

In the end I'm not sure what's the purpose of specifying the genome 
anyway. What does it give us? Maybe the vignette and examples in 
VariantAnnotation should stop doing that? Better to not specify the 
genome than specifying the wrong one.

Best,
H.
On 8/6/20 07:42, Robert Castelo wrote:

  
    
#
Hi Herve,

The performance of seqlevelStyle when the genome of a GRanges object has been set, still seems strange to me.
Here is some example code I ran in the Bioconductor:devel docker:
gr <- GRanges(rep(c("chr2", "chr3", "chrM"), 2), IRanges(1:6, 10))
genome(gr) <- "hg19"
seqlevelsStyle(gr) <- "NCBI"

This results in R freezing for more than a minute, after which the following cryptic error is given: 
Error in function (type, msg, asError = TRUE)  :
  Failed to connect to ftp.ncbi.nlm.nih.gov port 21: Connection refused

Maybe seqlevelsStyle should check if the genome of a GRanges object has been set and throw an informative error?

With kind regards,
Freek Manders

?On 06/08/2020, 19:18, "Bioc-devel on behalf of Herv? Pag?s" <bioc-devel-bounces at r-project.org on behalf of hpages at fredhutch.org> wrote:

    Hi Robert,

    Yes seqlevelsStyle's new behavior is slightly different and less 
    forgiving. The thing is that it will generally reveal dormant issues 
    which is not such a bad thing after all.

    Note that it doesn't seem completely straightforward to retrieve the 
    reference genome/assembly directly from the VCF header. AFAICT this 
    information is either missing or weirdly formatted. For example the 
    headers of the 1000genomes VCF files located at 
    ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ 
    contain

       ##reference=1000Genomes-NCBI37

    or in the ex2.vcf file included in VariantAnnotation it's:

       ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta

    so not clear that importing this in the genome field of the returned VCF 
    object would be that helpful.

    Thanks for pointing me to the VariantAnnotation vignette. I'll fix the 
    calls to readVcf() to use GRCh37 instead of hg19. Seems like one call 
    (on ex2.vcf) is using the wrong genome: ex2.vcf is based on hg18/NCBI36, 
    not on hg19/GRCh37. Will fix that too.

    Sure readVcf() could probably be improved to perform some sanity checks 
    by making sure that the user-supplied genome is compatible with the 
    chromosome names. However that still won't prevent the user from 
    specifying the wrong genome (e.g. GRCh37 instead of NCBI36) like in the 
    ex2.vcf case. Anyway this is a feature request for readVcf().

    In the end I'm not sure what's the purpose of specifying the genome 
    anyway. What does it give us? Maybe the vignette and examples in 
    VariantAnnotation should stop doing that? Better to not specify the 
    genome than specifying the wrong one.

    Best,
    H.
On 8/6/20 07:42, Robert Castelo wrote:
> hi Herv?,
    > 
    > thank you very much for your clarifications, but this behavior is 
    > different in release and has been different until now, this is BioC 3.11:
    > 
    > library(VariantAnnotation)
    > 
    > fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
    > vcf <- readVcf(fl, "hg19")
    > seqlevels(vcf)
    > [1] "22"
    > seqlevelsStyle(vcf)
    > [1] "NCBI"    "Ensembl"
    > 
    > i appreciate that the behavior now in devel is more consistent, i 
    > actually never understood the need to specify the 'genome="hg19"' 
    > argument since this in principle can be figured out from the VCF header 
    > information. However, the documentation has become right now confusing, 
    > if you go to subsection 2.1 and 2.1.2 from the introductory vignette, it 
    > shows using readVcf() with "hg19" but then the sequence names are 
    > literally what they are in the VCF file (NCBI style)
    > 
    > because of the large user base of VariantAnnotation (top-49 download) 
    > and the many possible reverse dependencies downstream, i'd suggest that 
    > either readVcf() issues an error or, maybe even better, overrides the 
    > sequence level style in the VCF file maybe with a warning, when the 
    > 'genome' argument does not match the sequence style of the VCF file.
    > 
    > cheers,
    > 
    > robert.
    > 
    >
> On 04/08/2020 18:29, Herv? Pag?s wrote:
>> Hi Robert,
    >>
    >> The VCF file uses "22" for the chromosome name which is the name used 
    >> by NCBI. So explicitly specifying "hg19" in the readVcf() call is like 
    >> saying that this chromosome name is a UCSC name which is why 
    >> seqlevelsStyle() gets confused later.
    >>
    >> If you specify the name of the NCBI assembly, things work as expected:
    >>
    >>   fl <- system.file("extdata", "chr22.vcf.gz", 
    >> package="VariantAnnotation")
    >>   vcf <- readVcf(fl, "GRCh37")
    >>   seqlevels(vcf)
    >>   # [1] "22"
    >>   seqlevelsStyle(vcf)
    >>   # [1] "NCBI"
    >>   seqlevelsStyle(vcf) <- "UCSC"
    >>   seqlevels(vcf)
    >>   # [1] "chr22"
    >>
    >> Or, if you don't know what reference genome the file is based on, 
    >> don't specify it:
    >>
    >>   fl <- system.file("extdata", "chr22.vcf.gz", 
    >> package="VariantAnnotation")
    >>   vcf <- readVcf(fl)
    >>   seqlevels(vcf)
    >>   # [1] "22"
    >>   seqlevelsStyle(vcf)
    >>   # [1] "NCBI"    "Ensembl"
    >>   seqlevelsStyle(vcf) <- "UCSC"
    >>   seqlevels(vcf)
    >>   # [1] "chr22"
    >>
    >> or specify it later:
    >>
    >>   genome(vcf) <- "hg19"
    >>   seqinfo(vcf)
    >>   # Seqinfo object with 1 sequence from hg19 genome; no seqlengths:
    >>   #   seqnames seqlengths isCircular genome
    >>   #   chr22            NA         NA   hg19
    >>
    >> Hope this helps,
    >> H.
    >>
    >>
>> On 7/29/20 08:30, Robert Castelo wrote:
>>> hi,
    >>>
    >>> it looks like either VariantAnnotation::readVcf() or something in the 
    >>> CollapsedVCF class broke in devel with respect to reading and setting 
    >>> sequence styles:
    >>>
    >>> library(VariantAnnotation)
    >>>
    >>> fl <- system.file("extdata", "chr22.vcf.gz", 
    >>> package="VariantAnnotation")
    >>> vcf <- readVcf(fl, "hg19")
    >>> seqlevels(vcf)
    >>> [1] "22"
    >>> seqlevelsStyle(vcf)
    >>> [1] "UCSC"
    >>> seqlevelsStyle(vcf) <- "UCSC"
    >>> seqlevels(vcf)
    >>> [1] "22"
    >>>
    >>> you can find my session information below. let me know if you want me 
    >>> to open an issue at the GitHub repo (VariantAnnotatoin or 
    >>> GenomeInfoDb?).
    >>>
    >>> thanks!
    >>>
    >>> robert.
    >>>
    >>> BiocManager::version()
    >>> [1] ?3.12?
    >>> sessionInfo()
    >>> R version 4.0.0 (2020-04-24)
    >>> Platform: x86_64-pc-linux-gnu (64-bit)
    >>> Running under: Ubuntu 18.04.4 LTS
    >>>
    >>> Matrix products: default
    >>> BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
    >>>
    >>> locale:
    >>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
    >>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
    >>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C
    >>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
    >>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
    >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
    >>>
    >>> attached base packages:
    >>> [1] stats4    parallel  stats     graphics  grDevices utils datasets
    >>> [8] methods   base
    >>>
    >>> other attached packages:
    >>>   [1] VariantAnnotation_1.35.3 Rsamtools_2.5.3
    >>>   [3] Biostrings_2.57.2 XVector_0.29.3
    >>>   [5] SummarizedExperiment_1.19.6 DelayedArray_0.15.7
    >>>   [7] matrixStats_0.56.0 Matrix_1.2-18
    >>>   [9] Biobase_2.49.0 GenomicRanges_1.41.5
    >>> [11] GenomeInfoDb_1.25.8 IRanges_2.23.10
    >>> [13] S4Vectors_0.27.12 BiocGenerics_0.35.4
    >>> [15] BiocManager_1.30.10
    >>>
    >>> loaded via a namespace (and not attached):
    >>>   [1] progress_1.2.2           tidyselect_1.1.0 purrr_0.3.4
    >>>   [4] lattice_0.20-41          vctrs_0.3.1 generics_0.0.2
    >>>   [7] BiocFileCache_1.13.0     rtracklayer_1.49.4 GenomicFeatures_1.41.2
    >>> [10] blob_1.2.1               XML_3.99-0.4 rlang_0.4.6
    >>> [13] pillar_1.4.4             glue_1.4.1 DBI_1.1.0
    >>> [16] rappdirs_0.3.1           BiocParallel_1.23.2 bit64_0.9-7.1
    >>> [19] dbplyr_1.4.4             GenomeInfoDbData_1.2.3 lifecycle_0.2.0
    >>> [22] stringr_1.4.0            zlibbioc_1.35.0 memoise_1.1.0
    >>> [25] biomaRt_2.45.2           curl_4.3 AnnotationDbi_1.51.3
    >>> [28] Rcpp_1.0.4.6             BSgenome_1.57.5 openssl_1.4.1
    >>> [31] bit_1.1-15.2             hms_0.5.3 askpass_1.1
    >>> [34] digest_0.6.25            stringi_1.4.6 dplyr_1.0.0
    >>> [37] grid_4.0.0               tools_4.0.0 bitops_1.0-6
    >>> [40] magrittr_1.5             RCurl_1.98-1.2 RSQLite_2.2.0
    >>> [43] tibble_3.0.1             crayon_1.3.4 pkgconfig_2.0.3
    >>> [46] ellipsis_0.3.1           prettyunits_1.1.1 assertthat_0.2.1
    >>> [49] httr_1.4.1               R6_2.4.1 GenomicAlignments_1.25.3
    >>> [52] compiler_4.0.0
    >>>
    >>> _______________________________________________
    >>> Bioc-devel at r-project.org mailing list
    >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=gp0KKC6W1uS1YnyFI5iSuxF5WSUpOhbHwL94GRP8yu0&s=Co1P5SErF64uPYhHMltM3De48hQLl-XHK3gfZOEnSKc&e= 
    >>>
    >>
    > 

    -- 
    Herv? Pag?s

    Program in Computational Biology
    Division of Public Health Sciences
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N, M1-B514
    P.O. Box 19024
    Seattle, WA 98109-1024

    E-mail: hpages at fredhutch.org
    Phone:  (206) 667-5791
    Fax:    (206) 667-1319

    _______________________________________________
    Bioc-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel