hi, i've found the following glitches with 'releaseName()' and 'seqlevelsStyle()', i guess due to recent changes in BSgenome and GenomeInfoDb, which is why i'm cc'ing Herv? ;) ===========BioC release 3.11 everything OK============= library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene releaseName(Hsapiens) [1] "Genome Reference Consortium GRCh37.p13" releaseName(as(Hsapiens, "GenomeDescription")) [1] "Genome Reference Consortium GRCh37.p13" seqlevelsStyle(txdb) [1] "UCSC" seqlevelsStyle(txdb) <- "NCBI" seqlevelsStyle(txdb) [1] "NCBI"??? "Ensembl" head(seqlengths(txdb)) ??????? 1???????? 2???????? 3???????? 4???????? 5 6 249250621 243199373 198022430 191154276 180915260 171115067 ======================================================= ============BioC devel 3.12 inconsistencies============ library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene releaseName(Hsapiens) Error in (function (classes, fdef, mtable)? : ? unable to find an inherited method for function ?releaseName? for signature ?"BSgenome"? releaseName(as(Hsapiens, "GenomeDescription")) [1] NA seqlevelsStyle(txdb) [1] "UCSC" seqlevelsStyle(txdb) <- "NCBI" seqlevelsStyle(txdb) [1] "NCBI" "UCSC" head(seqlengths(txdb)) ??????? 1???????? 2???????? 3???????? 4???????? 5 6 249250621 243199373 198022430 191154276 180915260 171115067 ======================================================== in the case of 'releaseName()' i interpret that the method has become defunct for 'BSgenome' objects, but i think it should give a sensible result with a 'GenomeDescription' object, so something may be going wrong with the coercion from 'BSgenome' to 'GenomeDescription'. in the case of 'seqlevelsStyle()', while the change of style seems to happen, i don't see how the style can be simultaneously NCBI and UCSC. below my session information for the BioC devel 3.12 version. thanks!! robert. sessionInfo() R version 4.0.0 (2020-04-24) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.4 LTS Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so locale: ?[1] LC_CTYPE=en_US.UTF-8?????? LC_NUMERIC=C ?[3] LC_TIME=en_US.UTF-8??????? LC_COLLATE=en_US.UTF-8 ?[5] LC_MONETARY=en_US.UTF-8??? LC_MESSAGES=C ?[7] LC_PAPER=en_US.UTF-8?????? LC_NAME=C ?[9] LC_ADDRESS=C?????????????? LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4??? parallel? stats???? graphics? grDevices utils???? datasets [8] methods?? base other attached packages: ?[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 ?[2] GenomicFeatures_1.41.2 ?[3] AnnotationDbi_1.51.3 ?[4] Biobase_2.49.0 ?[5] BSgenome.Hsapiens.UCSC.hg19_1.4.3 ?[6] BSgenome_1.57.5 ?[7] rtracklayer_1.49.5 ?[8] Biostrings_2.57.2 ?[9] XVector_0.29.3 [10] GenomicRanges_1.41.6 [11] GenomeInfoDb_1.25.10 [12] IRanges_2.23.10 [13] S4Vectors_0.27.12 [14] BiocGenerics_0.35.4 [15] BiocManager_1.30.10 loaded via a namespace (and not attached): ?[1] SummarizedExperiment_1.19.6 progress_1.2.2 ?[3] tidyselect_1.1.0 purrr_0.3.4 ?[5] lattice_0.20-41 generics_0.0.2 ?[7] vctrs_0.3.1 BiocFileCache_1.13.1 ?[9] blob_1.2.1 XML_3.99-0.4 [11] rlang_0.4.6 pillar_1.4.4 [13] glue_1.4.1 DBI_1.1.0 [15] rappdirs_0.3.1 BiocParallel_1.23.2 [17] bit64_0.9-7.1 dbplyr_1.4.4 [19] matrixStats_0.56.0 GenomeInfoDbData_1.2.3 [21] lifecycle_0.2.0 stringr_1.4.0 [23] zlibbioc_1.35.0 memoise_1.1.0 [25] biomaRt_2.45.2 curl_4.3 [27] Rcpp_1.0.4.6 openssl_1.4.1 [29] DelayedArray_0.15.7 bit_1.1-15.2 [31] Rsamtools_2.5.3 hms_0.5.3 [33] askpass_1.1 digest_0.6.25 [35] stringi_1.4.6 dplyr_1.0.0 [37] grid_4.0.0 tools_4.0.0 [39] bitops_1.0-6 magrittr_1.5 [41] RCurl_1.98-1.2 RSQLite_2.2.0 [43] tibble_3.0.1 crayon_1.3.4 [45] pkgconfig_2.0.3 ellipsis_0.3.1 [47] Matrix_1.2-18 prettyunits_1.1.1 [49] assertthat_0.2.1 httr_1.4.1 [51] R6_2.4.1 GenomicAlignments_1.25.3 [53] compiler_4.0.0
[Bioc-devel] glitches with releaseName() and seqlevelsStyle()
4 messages · Robert Castelo, Hervé Pagès
Hi Robert,
On 9/2/20 11:12, Robert Castelo wrote:
hi, i've found the following glitches with 'releaseName()' and 'seqlevelsStyle()', i guess due to recent changes in BSgenome and GenomeInfoDb, which is why i'm cc'ing Herv? ;) ===========BioC release 3.11 everything OK============= library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene releaseName(Hsapiens) [1] "Genome Reference Consortium GRCh37.p13" releaseName(as(Hsapiens, "GenomeDescription")) [1] "Genome Reference Consortium GRCh37.p13" seqlevelsStyle(txdb) [1] "UCSC" seqlevelsStyle(txdb) <- "NCBI" seqlevelsStyle(txdb) [1] "NCBI"??? "Ensembl" head(seqlengths(txdb)) ??????? 1???????? 2???????? 3???????? 4???????? 5 6 249250621 243199373 198022430 191154276 180915260 171115067 ======================================================= ============BioC devel 3.12 inconsistencies============ library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene releaseName(Hsapiens) Error in (function (classes, fdef, mtable)? : ? unable to find an inherited method for function ?releaseName? for signature ?"BSgenome"? releaseName(as(Hsapiens, "GenomeDescription")) [1] NA seqlevelsStyle(txdb) [1] "UCSC" seqlevelsStyle(txdb) <- "NCBI" seqlevelsStyle(txdb) [1] "NCBI" "UCSC" head(seqlengths(txdb)) ??????? 1???????? 2???????? 3???????? 4???????? 5 6 249250621 243199373 198022430 191154276 180915260 171115067 ======================================================== in the case of 'releaseName()' i interpret that the method has become defunct for 'BSgenome' objects, but i think it should give a sensible result with a 'GenomeDescription' object, so something may be going wrong with the coercion from 'BSgenome' to 'GenomeDescription'.
I'm dropping the notion of "release name" for BSgenome objects in BioC 3.12. For backward compatibility I just restored a temporary releaseName() method (in BSgenome 1.57.6) that does the following: > releaseName(BSgenome.Hsapiens.UCSC.hg19) [1] NA Warning message: Starting with Bioconductor 3.12, BSgenome objects no longer have a "release name". As a consequence of this change calling releaseName() on a BSgenome object now always returns NA and is deprecated.
in the case of 'seqlevelsStyle()', while the change of style seems to happen, i don't see how the style can be simultaneously NCBI and UCSC.
This happens when you end up with a mix of styles. In the case of hg19, all the sequences get renamed from UCSC to NCBI names **except** chrM: library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene table(genome(txdb)) # hg19 # 93 seqlevelsStyle(txdb) <- "NCBI" seqlevelsStyle(txdb) # [1] "NCBI" "UCSC" table(genome(txdb)) # GRCh37.p13 hg19 # 92 1 genome(txdb)[genome(txdb) == "hg19"] # chrM # "hg19" chrM cannot and should not be renamed to MT because the MT sequence in GRCh37.p13 is NOT the same as the chrM sequence in hg19. So the new seqlevelsStyle() behavior fixes a long standing bug. See previous discussion on this list about the long dance between UCSC and NCBI about the mitochondrian sequence in hg19/GRCh37/GRCh37.p13: https://stat.ethz.ch/pipermail/bioc-devel/2020-August/017086.html Hope this helps, H.
below my session information for the BioC devel 3.12 version. thanks!! robert. sessionInfo() R version 4.0.0 (2020-04-24) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.4 LTS Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so locale: ?[1] LC_CTYPE=en_US.UTF-8?????? LC_NUMERIC=C ?[3] LC_TIME=en_US.UTF-8??????? LC_COLLATE=en_US.UTF-8 ?[5] LC_MONETARY=en_US.UTF-8??? LC_MESSAGES=C ?[7] LC_PAPER=en_US.UTF-8?????? LC_NAME=C ?[9] LC_ADDRESS=C?????????????? LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4??? parallel? stats???? graphics? grDevices utils???? datasets [8] methods?? base other attached packages: ?[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 ?[2] GenomicFeatures_1.41.2 ?[3] AnnotationDbi_1.51.3 ?[4] Biobase_2.49.0 ?[5] BSgenome.Hsapiens.UCSC.hg19_1.4.3 ?[6] BSgenome_1.57.5 ?[7] rtracklayer_1.49.5 ?[8] Biostrings_2.57.2 ?[9] XVector_0.29.3 [10] GenomicRanges_1.41.6 [11] GenomeInfoDb_1.25.10 [12] IRanges_2.23.10 [13] S4Vectors_0.27.12 [14] BiocGenerics_0.35.4 [15] BiocManager_1.30.10 loaded via a namespace (and not attached): ?[1] SummarizedExperiment_1.19.6 progress_1.2.2 ?[3] tidyselect_1.1.0 purrr_0.3.4 ?[5] lattice_0.20-41 generics_0.0.2 ?[7] vctrs_0.3.1 BiocFileCache_1.13.1 ?[9] blob_1.2.1 XML_3.99-0.4 [11] rlang_0.4.6 pillar_1.4.4 [13] glue_1.4.1 DBI_1.1.0 [15] rappdirs_0.3.1 BiocParallel_1.23.2 [17] bit64_0.9-7.1 dbplyr_1.4.4 [19] matrixStats_0.56.0 GenomeInfoDbData_1.2.3 [21] lifecycle_0.2.0 stringr_1.4.0 [23] zlibbioc_1.35.0 memoise_1.1.0 [25] biomaRt_2.45.2 curl_4.3 [27] Rcpp_1.0.4.6 openssl_1.4.1 [29] DelayedArray_0.15.7 bit_1.1-15.2 [31] Rsamtools_2.5.3 hms_0.5.3 [33] askpass_1.1 digest_0.6.25 [35] stringi_1.4.6 dplyr_1.0.0 [37] grid_4.0.0 tools_4.0.0 [39] bitops_1.0-6 magrittr_1.5 [41] RCurl_1.98-1.2 RSQLite_2.2.0 [43] tibble_3.0.1 crayon_1.3.4 [45] pkgconfig_2.0.3 ellipsis_0.3.1 [47] Matrix_1.2-18 prettyunits_1.1.1 [49] assertthat_0.2.1 httr_1.4.1 [51] R6_2.4.1 GenomicAlignments_1.25.3 [53] compiler_4.0.0
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
Hi Herv?,
On 02/09/2020 21:44, Pages, Herve wrote:
Hi Robert, On 9/2/20 11:12, Robert Castelo wrote:
hi, i've found the following glitches with 'releaseName()' and 'seqlevelsStyle()', i guess due to recent changes in BSgenome and GenomeInfoDb, which is why i'm cc'ing Herv? ;) ===========BioC release 3.11 everything OK============= library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene releaseName(Hsapiens) [1] "Genome Reference Consortium GRCh37.p13" releaseName(as(Hsapiens, "GenomeDescription")) [1] "Genome Reference Consortium GRCh37.p13" seqlevelsStyle(txdb) [1] "UCSC" seqlevelsStyle(txdb) <- "NCBI" seqlevelsStyle(txdb) [1] "NCBI"??? "Ensembl" head(seqlengths(txdb)) ??????? 1???????? 2???????? 3???????? 4???????? 5 6 249250621 243199373 198022430 191154276 180915260 171115067 ======================================================= ============BioC devel 3.12 inconsistencies============ library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene releaseName(Hsapiens) Error in (function (classes, fdef, mtable)? : ? unable to find an inherited method for function ?releaseName? for signature ?"BSgenome"? releaseName(as(Hsapiens, "GenomeDescription")) [1] NA seqlevelsStyle(txdb) [1] "UCSC" seqlevelsStyle(txdb) <- "NCBI" seqlevelsStyle(txdb) [1] "NCBI" "UCSC" head(seqlengths(txdb)) ??????? 1???????? 2???????? 3???????? 4???????? 5 6 249250621 243199373 198022430 191154276 180915260 171115067 ======================================================== in the case of 'releaseName()' i interpret that the method has become defunct for 'BSgenome' objects, but i think it should give a sensible result with a 'GenomeDescription' object, so something may be going wrong with the coercion from 'BSgenome' to 'GenomeDescription'.
I'm dropping the notion of "release name" for BSgenome objects in BioC 3.12. For backward compatibility I just restored a temporary releaseName() method (in BSgenome 1.57.6) that does the following:
> releaseName(BSgenome.Hsapiens.UCSC.hg19)
[1] NA
Warning message:
Starting with Bioconductor 3.12, BSgenome objects no longer have a
"release name". As a consequence of this change calling releaseName()
on a BSgenome object now always returns NA and is deprecated.
ah ok, i see, this is fine, but the purpose of the 'releaseName()' function in the documentation is confusing, 'help(releaseName)' says: ?releaseName(x)?: Return the release name of this genome, which is ????????? generally made of the name of the organization who assembled ????????? it plus its Build version.? For example, UCSC uses ?"hg18"? ????????? for the version of the Human genome corresponding to the ????????? Build 36.1 from NCBI hence the release name for this genome ????????? is ?"NCBI Build 36.1"?. but, if i'm interpreting you correctly, genomes in BSgenome.* packages have no release name anymore. maybe the help page of 'releaseName', which is the one of the 'GenomeDescription' class should say only: 'releaseName(x)': Return the release name of 'x'. for whatever 'x' may be and whatever a release name may mean.
in the case of 'seqlevelsStyle()', while the change of style seems to happen, i don't see how the style can be simultaneously NCBI and UCSC.
This happens when you end up with a mix of styles. In the case of hg19,
all the sequences get renamed from UCSC to NCBI names **except** chrM:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
table(genome(txdb))
# hg19
# 93
seqlevelsStyle(txdb) <- "NCBI"
seqlevelsStyle(txdb)
# [1] "NCBI" "UCSC"
table(genome(txdb))
# GRCh37.p13 hg19
# 92 1
genome(txdb)[genome(txdb) == "hg19"]
# chrM
# "hg19"
chrM cannot and should not be renamed to MT because the MT sequence in
GRCh37.p13 is NOT the same as the chrM sequence in hg19. So the new
seqlevelsStyle() behavior fixes a long standing bug.
See previous discussion on this list about the long dance between UCSC
and NCBI about the mitochondrian sequence in hg19/GRCh37/GRCh37.p13:
https://stat.ethz.ch/pipermail/bioc-devel/2020-August/017086.html
Hope this helps,
H.
i see, this makes sense, i've followed the discussion about the
mitochondrian sequence naming, but didn't connect it with this behavior.
i'd like to set the sequence style of a TxDb object to whatever the
sequence style is of the BSgenome object, dropping the levels in the
TxDb object that are not common to the BSgenome object (removing the
mitochondrial chromosome in the case of hg19/GRCh37), so that the TxDb
object has a single sequence style, i thought this would work but i get
an error:
library(BSgenome.Hsapiens.1000genomes.hs37d5)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
bsgenome <- hs37d5
seqlevelsStyle(txdb) <- seqlevelsStyle(bsgenome)
Warning message:
In .normarg_seqlevelsStyle(value) :
? more than one seqlevels style supplied, using the 1st one only
seqlevelsStyle(txdb)
[1] "NCBI" "UCSC"
commonChr <- intersect(seqlevels(txdb), seqlevels(bsgenome))
commonChr
?[1] "1"? "2"? "3"? "4"? "5"? "6"? "7"? "8"? "9"? "10" "11" "12" "13"
"14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "X"? "Y"
txdb <- keepSeqlevels(txdb, commonChr)
seqlevelsStyle(txdb)
Error in .normarg_genome(value, seqnames(x)) :
? when length of supplied 'genome' vector is not 1, then it must equal
? the number of sequences
traceback()
9: stop(wmsg("when length of supplied 'genome' vector is not 1, ",
?????? "then it must equal the number of sequences"))
8: .normarg_genome(value, seqnames(x))
7: `genome<-`(`*tmp*`, value = x$user_genome)
6: `genome<-`(`*tmp*`, value = x$user_genome)
5: seqinfo(x)
4: seqinfo(x)
3: seqlevelsStyle(seqinfo(x))
2: seqlevelsStyle(txdb)
1: seqlevelsStyle(txdb)
i guess this should have worked, right?
thanks!
robert.
below my session information for the BioC devel 3.12 version. thanks!! robert. sessionInfo() R version 4.0.0 (2020-04-24) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.4 LTS Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so locale: ?[1] LC_CTYPE=en_US.UTF-8?????? LC_NUMERIC=C ?[3] LC_TIME=en_US.UTF-8??????? LC_COLLATE=en_US.UTF-8 ?[5] LC_MONETARY=en_US.UTF-8??? LC_MESSAGES=C ?[7] LC_PAPER=en_US.UTF-8?????? LC_NAME=C ?[9] LC_ADDRESS=C?????????????? LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4??? parallel? stats???? graphics? grDevices utils???? datasets [8] methods?? base other attached packages: ?[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 ?[2] GenomicFeatures_1.41.2 ?[3] AnnotationDbi_1.51.3 ?[4] Biobase_2.49.0 ?[5] BSgenome.Hsapiens.UCSC.hg19_1.4.3 ?[6] BSgenome_1.57.5 ?[7] rtracklayer_1.49.5 ?[8] Biostrings_2.57.2 ?[9] XVector_0.29.3 [10] GenomicRanges_1.41.6 [11] GenomeInfoDb_1.25.10 [12] IRanges_2.23.10 [13] S4Vectors_0.27.12 [14] BiocGenerics_0.35.4 [15] BiocManager_1.30.10 loaded via a namespace (and not attached): ?[1] SummarizedExperiment_1.19.6 progress_1.2.2 ?[3] tidyselect_1.1.0 purrr_0.3.4 ?[5] lattice_0.20-41 generics_0.0.2 ?[7] vctrs_0.3.1 BiocFileCache_1.13.1 ?[9] blob_1.2.1 XML_3.99-0.4 [11] rlang_0.4.6 pillar_1.4.4 [13] glue_1.4.1 DBI_1.1.0 [15] rappdirs_0.3.1 BiocParallel_1.23.2 [17] bit64_0.9-7.1 dbplyr_1.4.4 [19] matrixStats_0.56.0 GenomeInfoDbData_1.2.3 [21] lifecycle_0.2.0 stringr_1.4.0 [23] zlibbioc_1.35.0 memoise_1.1.0 [25] biomaRt_2.45.2 curl_4.3 [27] Rcpp_1.0.4.6 openssl_1.4.1 [29] DelayedArray_0.15.7 bit_1.1-15.2 [31] Rsamtools_2.5.3 hms_0.5.3 [33] askpass_1.1 digest_0.6.25 [35] stringi_1.4.6 dplyr_1.0.0 [37] grid_4.0.0 tools_4.0.0 [39] bitops_1.0-6 magrittr_1.5 [41] RCurl_1.98-1.2 RSQLite_2.2.0 [43] tibble_3.0.1 crayon_1.3.4 [45] pkgconfig_2.0.3 ellipsis_0.3.1 [47] Matrix_1.2-18 prettyunits_1.1.1 [49] assertthat_0.2.1 httr_1.4.1 [51] R6_2.4.1 GenomicAlignments_1.25.3 [53] compiler_4.0.0
15 days later
oops, sorry for letting this slip off my radar. see below...
On 9/3/20 05:22, Robert Castelo wrote:
Hi Herv?, On 02/09/2020 21:44, Pages, Herve wrote:
Hi Robert, On 9/2/20 11:12, Robert Castelo wrote:
hi, i've found the following glitches with 'releaseName()' and 'seqlevelsStyle()', i guess due to recent changes in BSgenome and GenomeInfoDb, which is why i'm cc'ing Herv? ;) ===========BioC release 3.11 everything OK============= library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene releaseName(Hsapiens) [1] "Genome Reference Consortium GRCh37.p13" releaseName(as(Hsapiens, "GenomeDescription")) [1] "Genome Reference Consortium GRCh37.p13" seqlevelsStyle(txdb) [1] "UCSC" seqlevelsStyle(txdb) <- "NCBI" seqlevelsStyle(txdb) [1] "NCBI"??? "Ensembl" head(seqlengths(txdb)) ? ??????? 1???????? 2???????? 3???????? 4???????? 5 6 249250621 243199373 198022430 191154276 180915260 171115067 ======================================================= ============BioC devel 3.12 inconsistencies============ library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene releaseName(Hsapiens) Error in (function (classes, fdef, mtable)? : ? ? unable to find an inherited method for function ?releaseName? for signature ?"BSgenome"? releaseName(as(Hsapiens, "GenomeDescription")) [1] NA seqlevelsStyle(txdb) [1] "UCSC" seqlevelsStyle(txdb) <- "NCBI" seqlevelsStyle(txdb) [1] "NCBI" "UCSC" head(seqlengths(txdb)) ? ??????? 1???????? 2???????? 3???????? 4???????? 5 6 249250621 243199373 198022430 191154276 180915260 171115067 ======================================================== in the case of 'releaseName()' i interpret that the method has become defunct for 'BSgenome' objects, but i think it should give a sensible result with a 'GenomeDescription' object, so something may be going wrong with the coercion from 'BSgenome' to 'GenomeDescription'.
I'm dropping the notion of "release name" for BSgenome objects in BioC 3.12. For backward compatibility I just restored a temporary releaseName() method (in BSgenome 1.57.6) that does the following: ? > releaseName(BSgenome.Hsapiens.UCSC.hg19) [1] NA Warning message: ??? Starting with Bioconductor 3.12, BSgenome objects no longer have a ??? "release name". As a consequence of this change calling releaseName() ??? on a BSgenome object now always returns NA and is deprecated.
ah ok, i see, this is fine, but the purpose of the 'releaseName()' function in the documentation is confusing, 'help(releaseName)' says: ?releaseName(x)?: Return the release name of this genome, which is ????????? generally made of the name of the organization who assembled ????????? it plus its Build version.? For example, UCSC uses ?"hg18"? ????????? for the version of the Human genome corresponding to the ????????? Build 36.1 from NCBI hence the release name for this genome ????????? is ?"NCBI Build 36.1"?. but, if i'm interpreting you correctly, genomes in BSgenome.* packages have no release name anymore. maybe the help page of 'releaseName', which is the one of the 'GenomeDescription' class should say only: 'releaseName(x)': Return the release name of 'x'. for whatever 'x' may be and whatever a release name may mean.
Fair point. Since releaseName() will go away at some point (not just the
method for BSgenome objects but also other methods), I clarified the man
page:
releaseName(x):
IMPORTANT NOTE: The 'releaseName()' methods for GenomeDescription
and BSgenome objects are deprecated in BioC 3.12!
If 'x' is a GenomeDescription object, 'releaseName(x)'
returns the release name of this genome, which is usually made
of the name of the organization who assembled it plus its Build
version. For example, UCSC uses "hg18" for the version of
the Human genome corresponding to the Build 36.1 from NCBI hence
the release name for this genome is "NCBI Build 36.1".
If 'x' is a BSgenome object, releaseName(x) returns NA_character_
Hopefully this is clearer now.
in the case of 'seqlevelsStyle()', while the change of style seems to happen, i don't see how the style can be simultaneously NCBI and UCSC.
This happens when you end up with a mix of styles. In the case of hg19, all the sequences get renamed from UCSC to NCBI names **except** chrM: ??? library(TxDb.Hsapiens.UCSC.hg19.knownGene) ??? txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene ??? table(genome(txdb)) ??? # hg19 ??? #?? 93 ??? seqlevelsStyle(txdb) <- "NCBI" ??? seqlevelsStyle(txdb) ??? # [1] "NCBI" "UCSC" ??? table(genome(txdb)) ??? # GRCh37.p13?????? hg19 ??? #???????? 92????????? 1 ??? genome(txdb)[genome(txdb) == "hg19"] ??? #?? chrM ??? # "hg19" chrM cannot and should not be renamed to MT because the MT sequence in GRCh37.p13 is NOT the same as the chrM sequence in hg19. So the new seqlevelsStyle() behavior fixes a long standing bug. See previous discussion on this list about the long dance between UCSC and NCBI about the mitochondrian sequence in hg19/GRCh37/GRCh37.p13: https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_pipermail_bioc-2Ddevel_2020-2DAugust_017086.html&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=SHfBpjQphuJri9AAl3GM3KfgVIi86pd7i6qNgogVwaY&s=7yB8LncH55x4hksYXNtFV6yAHAuupS-itRParAJTOUA&e= Hope this helps, H.
i see, this makes sense, i've followed the discussion about the
mitochondrian sequence naming, but didn't connect it with this behavior.
i'd like to set the sequence style of a TxDb object to whatever the
sequence style is of the BSgenome object, dropping the levels in the
TxDb object that are not common to the BSgenome object (removing the
mitochondrial chromosome in the case of hg19/GRCh37), so that the TxDb
object has a single sequence style, i thought this would work but i get
an error:
library(BSgenome.Hsapiens.1000genomes.hs37d5)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
bsgenome <- hs37d5
seqlevelsStyle(txdb) <- seqlevelsStyle(bsgenome)
Warning message:
In .normarg_seqlevelsStyle(value) :
? more than one seqlevels style supplied, using the 1st one only
seqlevelsStyle(txdb)
[1] "NCBI" "UCSC"
commonChr <- intersect(seqlevels(txdb), seqlevels(bsgenome))
commonChr
?[1] "1"? "2"? "3"? "4"? "5"? "6"? "7"? "8"? "9"? "10" "11" "12" "13"
"14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "X"? "Y"
txdb <- keepSeqlevels(txdb, commonChr)
seqlevelsStyle(txdb)
Error in .normarg_genome(value, seqnames(x)) :
? when length of supplied 'genome' vector is not 1, then it must equal
? the number of sequences
traceback()
9: stop(wmsg("when length of supplied 'genome' vector is not 1, ",
?????? "then it must equal the number of sequences"))
8: .normarg_genome(value, seqnames(x))
7: `genome<-`(`*tmp*`, value = x$user_genome)
6: `genome<-`(`*tmp*`, value = x$user_genome)
5: seqinfo(x)
4: seqinfo(x)
3: seqlevelsStyle(seqinfo(x))
2: seqlevelsStyle(txdb)
1: seqlevelsStyle(txdb)
i guess this should have worked, right?
Yes it should. Should work now with GenomicFeatures 1.41.3. Don't hesitate to let me know if you run into other rough edges like this and sorry for the inconvenience. Cheers, H.
thanks! robert.
below my session information for the BioC devel 3.12 version. thanks!! robert. sessionInfo() R version 4.0.0 (2020-04-24) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.4 LTS Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so locale: ? ?[1] LC_CTYPE=en_US.UTF-8?????? LC_NUMERIC=C ? ?[3] LC_TIME=en_US.UTF-8??????? LC_COLLATE=en_US.UTF-8 ? ?[5] LC_MONETARY=en_US.UTF-8??? LC_MESSAGES=C ? ?[7] LC_PAPER=en_US.UTF-8?????? LC_NAME=C ? ?[9] LC_ADDRESS=C?????????????? LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4??? parallel? stats???? graphics? grDevices utils???? datasets [8] methods?? base other attached packages: ? ?[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 ? ?[2] GenomicFeatures_1.41.2 ? ?[3] AnnotationDbi_1.51.3 ? ?[4] Biobase_2.49.0 ? ?[5] BSgenome.Hsapiens.UCSC.hg19_1.4.3 ? ?[6] BSgenome_1.57.5 ? ?[7] rtracklayer_1.49.5 ? ?[8] Biostrings_2.57.2 ? ?[9] XVector_0.29.3 [10] GenomicRanges_1.41.6 [11] GenomeInfoDb_1.25.10 [12] IRanges_2.23.10 [13] S4Vectors_0.27.12 [14] BiocGenerics_0.35.4 [15] BiocManager_1.30.10 loaded via a namespace (and not attached): ? ?[1] SummarizedExperiment_1.19.6 progress_1.2.2 ? ?[3] tidyselect_1.1.0 purrr_0.3.4 ? ?[5] lattice_0.20-41 generics_0.0.2 ? ?[7] vctrs_0.3.1 BiocFileCache_1.13.1 ? ?[9] blob_1.2.1 XML_3.99-0.4 [11] rlang_0.4.6 pillar_1.4.4 [13] glue_1.4.1 DBI_1.1.0 [15] rappdirs_0.3.1 BiocParallel_1.23.2 [17] bit64_0.9-7.1 dbplyr_1.4.4 [19] matrixStats_0.56.0 GenomeInfoDbData_1.2.3 [21] lifecycle_0.2.0 stringr_1.4.0 [23] zlibbioc_1.35.0 memoise_1.1.0 [25] biomaRt_2.45.2 curl_4.3 [27] Rcpp_1.0.4.6 openssl_1.4.1 [29] DelayedArray_0.15.7 bit_1.1-15.2 [31] Rsamtools_2.5.3 hms_0.5.3 [33] askpass_1.1 digest_0.6.25 [35] stringi_1.4.6 dplyr_1.0.0 [37] grid_4.0.0 tools_4.0.0 [39] bitops_1.0-6 magrittr_1.5 [41] RCurl_1.98-1.2 RSQLite_2.2.0 [43] tibble_3.0.1 crayon_1.3.4 [45] pkgconfig_2.0.3 ellipsis_0.3.1 [47] Matrix_1.2-18 prettyunits_1.1.1 [49] assertthat_0.2.1 httr_1.4.1 [51] R6_2.4.1 GenomicAlignments_1.25.3 [53] compiler_4.0.0
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319