Skip to content
Back to formatted view

Raw Message

Message-ID: <42a2da8e-9409-6951-7844-577c5b3b7ed0@fredhutch.org>
Date: 2020-08-04T16:29:34Z
From: Hervé Pagès
Subject: [Bioc-devel] VariantAnnotation::readVcf() sets the wrong seqlevelsStyle in devel
In-Reply-To: <855133d2-e33d-354d-d194-723014d8ad06@upf.edu>

Hi Robert,

The VCF file uses "22" for the chromosome name which is the name used by 
NCBI. So explicitly specifying "hg19" in the readVcf() call is like 
saying that this chromosome name is a UCSC name which is why 
seqlevelsStyle() gets confused later.

If you specify the name of the NCBI assembly, things work as expected:

   fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
   vcf <- readVcf(fl, "GRCh37")
   seqlevels(vcf)
   # [1] "22"
   seqlevelsStyle(vcf)
   # [1] "NCBI"
   seqlevelsStyle(vcf) <- "UCSC"
   seqlevels(vcf)
   # [1] "chr22"

Or, if you don't know what reference genome the file is based on, don't 
specify it:

   fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
   vcf <- readVcf(fl)
   seqlevels(vcf)
   # [1] "22"
   seqlevelsStyle(vcf)
   # [1] "NCBI"    "Ensembl"
   seqlevelsStyle(vcf) <- "UCSC"
   seqlevels(vcf)
   # [1] "chr22"

or specify it later:

   genome(vcf) <- "hg19"
   seqinfo(vcf)
   # Seqinfo object with 1 sequence from hg19 genome; no seqlengths:
   #   seqnames seqlengths isCircular genome
   #   chr22            NA         NA   hg19

Hope this helps,
H.


On 7/29/20 08:30, Robert Castelo wrote:
> hi,
> 
> it looks like either VariantAnnotation::readVcf() or something in the 
> CollapsedVCF class broke in devel with respect to reading and setting 
> sequence styles:
> 
> library(VariantAnnotation)
> 
> fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
> vcf <- readVcf(fl, "hg19")
> seqlevels(vcf)
> [1] "22"
> seqlevelsStyle(vcf)
> [1] "UCSC"
> seqlevelsStyle(vcf) <- "UCSC"
> seqlevels(vcf)
> [1] "22"
> 
> you can find my session information below. let me know if you want me to 
> open an issue at the GitHub repo (VariantAnnotatoin or GenomeInfoDb?).
> 
> thanks!
> 
> robert.
> 
> BiocManager::version()
> [1] ?3.12?
> sessionInfo()
> R version 4.0.0 (2020-04-24)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 18.04.4 LTS
> 
> Matrix products: default
> BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
> 
> locale:
>  ?[1] LC_CTYPE=en_US.UTF-8?????? LC_NUMERIC=C
>  ?[3] LC_TIME=en_US.UTF-8??????? LC_COLLATE=en_US.UTF-8
>  ?[5] LC_MONETARY=en_US.UTF-8??? LC_MESSAGES=C
>  ?[7] LC_PAPER=en_US.UTF-8?????? LC_NAME=C
>  ?[9] LC_ADDRESS=C?????????????? LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats4??? parallel? stats???? graphics? grDevices utils???? datasets
> [8] methods?? base
> 
> other attached packages:
>  ?[1] VariantAnnotation_1.35.3 Rsamtools_2.5.3
>  ?[3] Biostrings_2.57.2 XVector_0.29.3
>  ?[5] SummarizedExperiment_1.19.6 DelayedArray_0.15.7
>  ?[7] matrixStats_0.56.0 Matrix_1.2-18
>  ?[9] Biobase_2.49.0 GenomicRanges_1.41.5
> [11] GenomeInfoDb_1.25.8 IRanges_2.23.10
> [13] S4Vectors_0.27.12 BiocGenerics_0.35.4
> [15] BiocManager_1.30.10
> 
> loaded via a namespace (and not attached):
>  ?[1] progress_1.2.2?????????? tidyselect_1.1.0 purrr_0.3.4
>  ?[4] lattice_0.20-41????????? vctrs_0.3.1 generics_0.0.2
>  ?[7] BiocFileCache_1.13.0???? rtracklayer_1.49.4 GenomicFeatures_1.41.2
> [10] blob_1.2.1?????????????? XML_3.99-0.4 rlang_0.4.6
> [13] pillar_1.4.4???????????? glue_1.4.1 DBI_1.1.0
> [16] rappdirs_0.3.1?????????? BiocParallel_1.23.2 bit64_0.9-7.1
> [19] dbplyr_1.4.4???????????? GenomeInfoDbData_1.2.3 lifecycle_0.2.0
> [22] stringr_1.4.0??????????? zlibbioc_1.35.0 memoise_1.1.0
> [25] biomaRt_2.45.2?????????? curl_4.3 AnnotationDbi_1.51.3
> [28] Rcpp_1.0.4.6???????????? BSgenome_1.57.5 openssl_1.4.1
> [31] bit_1.1-15.2???????????? hms_0.5.3 askpass_1.1
> [34] digest_0.6.25??????????? stringi_1.4.6 dplyr_1.0.0
> [37] grid_4.0.0?????????????? tools_4.0.0 bitops_1.0-6
> [40] magrittr_1.5???????????? RCurl_1.98-1.2 RSQLite_2.2.0
> [43] tibble_3.0.1???????????? crayon_1.3.4 pkgconfig_2.0.3
> [46] ellipsis_0.3.1?????????? prettyunits_1.1.1 assertthat_0.2.1
> [49] httr_1.4.1?????????????? R6_2.4.1 GenomicAlignments_1.25.3
> [52] compiler_4.0.0
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=gp0KKC6W1uS1YnyFI5iSuxF5WSUpOhbHwL94GRP8yu0&s=Co1P5SErF64uPYhHMltM3De48hQLl-XHK3gfZOEnSKc&e= 
> 

-- 
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319