Skip to content
Prev 16952 / 21312 Next

[Bioc-devel] proposal for additional seqlevelsStyle

Hi Vince, Robert, Kasper,

I've done some work on this. Starting with GenomeInfoDb_1.25.7 the 
seqlevelsStyle() setter has 2 major improvements:

1. It knows how to rename contigs and scaffolds, not just the chromosomes:

   library(TxDb.Mmusculus.UCSC.mm10.knownGene)

   seqinfo(txdb)
   # Seqinfo object with 66 sequences (1 circular) from mm10 genome:
   # seqnames       seqlengths isCircular genome
   # chr1            195471971       <NA>   mm10
   # chr2            182113224       <NA>   mm10
   # chr3            160039680       <NA>   mm10
   # chr4            156508116       <NA>   mm10
   # chr5            151834684       <NA>   mm10
   # ...                   ...        ...    ...
   # chrUn_GL456392      23629       <NA>   mm10
   # chrUn_GL456393      55711       <NA>   mm10
   # chrUn_GL456394      24323       <NA>   mm10
   # chrUn_GL456396      21240       <NA>   mm10
   # chrUn_JH584304     114452       <NA>   mm10

   seqlevelsStyle(txdb) <- "NCBI"

   seqinfo(txdb)
   # Seqinfo object with 66 sequences (1 circular) from GRCm38 genome:
   # seqnames      seqlengths isCircular genome
   # 1              195471971       <NA> GRCm38
   # 2              182113224       <NA> GRCm38
   # 3              160039680       <NA> GRCm38
   # 4              156508116       <NA> GRCm38
   # 5              151834684       <NA> GRCm38
   # ...                  ...        ...    ...
   # MSCHRUN_CTG10      23629       <NA> GRCm38
   # MSCHRUN_CTG11      55711       <NA> GRCm38
   # MSCHRUN_CTG12      24323       <NA> GRCm38
   # MSCHRUN_CTG15      21240       <NA> GRCm38
   # MSCHRUN_CTG23     114452       <NA> GRCm38

2. It supports new style RefSeq for renaming to/from RefSeq accessions:

   seqlevelsStyle(txdb) <- "RefSeq"

   seqinfo(txdb)
   # Seqinfo object with 66 sequences (1 circular) from GRCm38 genome:
   # seqnames    seqlengths isCircular genome
   # NC_000067.6  195471971       <NA> GRCm38
   # NC_000068.7  182113224       <NA> GRCm38
   # NC_000069.6  160039680       <NA> GRCm38
   # NC_000070.6  156508116       <NA> GRCm38
   # NC_000071.6  151834684       <NA> GRCm38
   # ...                ...        ...    ...
   # NT_166476.1      23629       <NA> GRCm38
   # NT_166477.1      55711       <NA> GRCm38
   # NT_166478.1      24323       <NA> GRCm38
   # NT_166480.1      21240       <NA> GRCm38
   # NT_187064.1     114452       <NA> GRCm38

These new features only work on objects for which the genome is set to 
an NCBI assembly (e.g. WBcel235) or UCSC genome (e.g. ce11). This is the 
case with TxDb, BSgenome, and SNPlocs objects.

The workhorses behind them are new low-level utilities 
getChromInfoFromNCBI() and getChromInfoFromUCSC(). These support 141 
NCBI assemblies and 74 UCSC genomes at the moment, respectively. It's 
easy to add new organisms. The gotcha is that they require internet 
access and so does the seqlevelsStyle() setter. This could be mitigated 
by caching the data via BiocFileCache.

Next thing on the list is to support the GenBank style (Vince's original 
request) to rename to/from GenBank accessions.

Cheers,
H.
On 12/13/19 10:51, Kasper Daniel Hansen wrote: