Skip to content

[Bioc-devel] A more flexible GenomeInfoDb::mapSeqlevels(): used supported info but don't break with new organisms/toy examples

6 messages · Arora, Sonali, Leonardo Collado Torres

#
Hello,

In my `derfinder` package I have been relying on `GenomeInfoDb` to
correctly name the sequence names. Given
https://support.bioconductor.org/p/62136/ I now realize that I was
expecting too much from `GenomeInfoDb`.

For example, in some cases I was expecting it to guess that a single
'2' meant that the naming style was NCBI. This is true for Homo
sapiens but now I realize that it's not necessary true for Arabidopsis
thaliana.
[1] "Arabidopsis thaliana" "NCBI"

## Could have been Homo sapiens (NCBI) or Arabidopbis thaliana TAIR10 or NCBI
NCBI UCSC dbSNP
2    2 chr2   ch2
NCBI TAIR10
2    2      2

I was also expecting mapSeqlevels() to return the same input if the
specified 'style' was the same as the one currently being used. For
example, if I was working with Homo sapiens NCBI style and attempted
to map to NCBI, I was expecting the same output as the input.

## More than 1 output
2
[1,] "2"
[2,] "LGII"
## Turns out there is another organism that would make this mapping valid
NCBI JGI2.F
2 LGII      2


I also noticed that `GenomeInfoDb` couldn't work with some fictional
examples. For example:

## Expected error, but this meant that `derfinder` couldn't use toy
examples from other
## packages.
Error in seqlevelsStyle("seq1") :
  The style does not have a compatible entry for the species supported
by Seqname. Please see genomeStyles() for supported
  species/style

## Using toy data from Rsamtools fails as expected
+                   mustWork=TRUE)
[1] "seq1" "seq2"
Error in seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets)) :
  The style does not have a compatible entry for the species supported
by Seqname. Please see genomeStyles() for supported
  species/style


So, I wanted a way to use `GenomeInfoDb` to make the correct naming
style maps for supported genomes, and simply return the original
sequence names if using an unsupported species/mapping or if the
original and target styles are the same.


That's what I basically did when I wrote 'extendedMapSeqlevels' (aye,
long name) which you can see at
https://gist.github.com/3e6f0e2e2bb67ee910da If you think that others
might be interested in such a function, then by all means please add
it to `GenomeInfoDb` and I'll import it. You'll notice how it tries to
guess the species and/or current naming style when that information is
not provided. When it's used with verbose = TRUE it will show a
message explaining why the mapping failed (if it did), it'll show the
link to the `GenomeInfoDb` vignette for adding an organism, and then
return the original sequence names.

I have a slightly modified version in derfinder
https://github.com/lcolladotor/derfinder/blob/master/R/extendedMapSeqlevels.R
 or https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/derfinder/R/extendedMapSeqlevels.R
where I simply assume some other default values and guarantee
continuity with how I was doing things previously.

Naturally, R CMD check now shows a NOTE because I'm using the :::
syntax for un-exported functions from `GenomeInfoDb`.

Cheers,
Leo


Leonardo Collado Torres, PhD Candidate
Department of Biostatistics
Johns Hopkins University
Bloomberg School of Public Health
Website: http://www.biostat.jhsph.edu/~lcollado/
Blog: http://lcolladotor.github.io/
#
Hi Leo,

To summarize, These are the concerns raised in your previous email :

*a) mapSeqlevels() should return current style along with other mapped 
styles? **
look at extendedMapSeqlevels()

*I am looking into this. I have a few thoughts/ questions for you.

The main idea of mapSeqlevels() is to map the current seqlevelStyle to 
other seqlevelStyles.

"I was also expecting mapSeqlevels() to return the same input if the
specified 'style' was the same as the one currently being used."


Returning the existing seqlevelStyle along with the mapped Seqlevel 
style sounds a
little redundant. (given that the user is already supplying it while 
calling mapSeqlevels()
so he already knows it!)

  "For
example, if I was working with Homo sapiens NCBI style and attempted
to map to NCBI, I was expecting the same output as the input."


why would you want to map it back to NCBI ? Please explain your use case 
more
concretely.

I'd be happy to make changes to mapSeqlevels() and look at your function 
more deeply,
once I understand the use case better. Thanks for clarifying this.

*b) Do you think it would be helpful to you and other developers of I 
export**
**.guessSpeciesStyle() and .supportedSeqnameMappings() from GenomeInfoDb ?*

*c) .guessSpeciesStyle('2') should return all possible hits for "seqnames"*

This was a great find! I have fixed it in version 1.3.2 - Thanks for 
figuring this out !
Does the following style of output make sense to you ? If multiple hits 
are found
there are returned in "species" and "style" respectively.

 > .guessSpeciesStyle(c(paste0("chr",1:10)))
$species
[1] "Homo sapiens" "Mus musculus"

$style
[1] "UCSC" "UCSC"

 > .guessSpeciesStyle(c(paste0("chr",1:22)))
$species
[1] "Homo sapiens"

$style
[1] "UCSC"

 > .guessSpeciesStyle("chr2")
$species
[1] "Homo sapiens" "Mus musculus"

$style
[1] "UCSC" "UCSC"

 > .guessSpeciesStyle("2")
$species
  [1] "Arabidopsis thaliana"    "Arabidopsis thaliana" "Cyanidioschyzon 
merolae"
  [4] "Homo sapiens"            "Mus musculus"            "Oryza sativa"
  [7] "Oryza sativa"            "Populus trichocarpa"     "Zea mays"
[10] "Zea mays"

$style
  [1] "NCBI"   "TAIR10" "NCBI"   "NCBI"   "NCBI"   "NCBI" "MSU6"   
"JGI2"   "NCBI"   "AGPvF"

*SeqlevelsStyle also returns multiple styles now (if it cant guess the 
correct one!) *

 > seqnames <- c(paste0("chr",1:22))
 > seqlevelsStyle(seqnames)
[1] "UCSC"

 > seqnames <- "2"
 > seqlevelsStyle(seqnames)
warning! Multiple seqnameStyles found.
[1] "NCBI"   "TAIR10" "MSU6"   "JGI2"   "AGPvF"

Thanks again for your feedback,
Sonali.

*
*
On 10/21/2014 8:47 PM, Maintainer wrote:

  
    
#
Hi Sonali,

The recent changes in GenomeInfoDb (1.3.3) look great!
On Wed, Oct 22, 2014 at 1:53 PM, Sonali Arora <sarora at fhcrc.org> wrote:
Totally!
I agree with you that my example doesn't make sense if a user is
manually calling mapSeqlevels().

In `derfinder`, several functions split the work by sequence name
(chromosomes). Others take two GRanges objects and match them. Others
create GRanges objects. Others load data from BAM or BigWig files.

In my use case, I wanted to make sure things would work even if the
user didn't make sure that (in the 2 GRanges case) that the objects
had the same sequence naming style. That is why the user doesn't
necessarily specify the `style` and I was using GenomeInfoDb to guess
it via `seqlevelsStyle()` as well as mapping the sequence names to the
default `style` to make sure the playing field was the same before
finding overlaps, etc via `mapSeqlevels()`.

This means that sometimes I am working with a single sequence name.
Hence the `seqlevelsStyle('2')` example which is a different scenario
from working with multiple sequence names.
warning! Multiple seqnameStyles found.
[1] "NCBI"   "TAIR10" "MSU6"   "JGI2"   "AGPvF"
[1] "NCBI"

I'm aware that I could simply dodge using mapSeqlevels() when the
default `style` is the same as the `style` currently used in the input
as shown below:
+     if(seqlevelsStyle(seqnames) != style) {
+         res <- mapSeqlevels(seqnames, style)
+     } else {
+         res <- seqnames
+     }
+     return(res)
+ }
## Works with multiple seqnames
1       2       3       4       5       6       7       8
 "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"
      9      10      11      12      13      14      15      16
 "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16"
     17      18      19      20      21      22
"chr17" "chr18" "chr19" "chr20" "chr21" "chr22"

## You also have to guess the species in using a small number of sequence names
## Or use only the first result: seqlevelsStyle(seqnames)[1]
warning! Multiple seqnameStyles found.
[1] "chr2"
Warning message:
In if (seqlevelsStyle(seqnames) != style) { :
  the condition has length > 1 and only the first element will be used
The second part (if you can call it like that) of my use case is that
I wanted to use the same code whether or not the sequence names /
organism supplied are not supported by GenomeInfoDb. That is, I wanted
to be flexible enough to use all the power from `GenomeInfoDb` when
there's enough information, and simply ignore the step of renaming
sequence styles if the information is missing.
`extendedMapSeqlevels()` can do this.

Alternatively, I could also use some tryCatch() calls and if I get an
error (as illustrated below) then simply avoid using `mapSeqlevels()`.

## Could also be '39' instead of 'seq1' if for some reason someone is
working with a genome that has 39 chromosomes.
## Currently the supported organism in GenomeInfoDb with the highest
number is Canis familiaris with 38 autosomal chrs.
Error in seqlevelsStyle("seq1") :
  The style does not have a compatible entry for the species
  supported by Seqname. Please see genomeStyles() for
  supported species/style

## Modified foo() which can work with unsupported genomes/styles
+     currentStyle <- tryCatch(seqlevelsStyle(seqnames), error =
function(e) { 'error' })
+     if(currentStyle[1] == 'error') {
+         return(seqnames)
+     } else {
+         ## Or use currentStyle[1] if you don't want the warning in zoo('2')
+         if(currentStyle != style)     {
+             res <- mapSeqlevels(seqnames, style)
+         } else {
+             res<- seqnames
+         }
+     }
+     return(seqnames)
+ }
[1] "seq1"

`extendedMapSeqlevels()` does more than this by guessing the current
style only for valid species. If the species is not supplied, it
guesses it as well.

`zoo()` above can be modified to take a `species` argument and/or
guess it by using `.guessSpeciesStyle()`.
No problem =) Thanks for digesting my email!
Say if `mapSeqlevels()` had a `species` argument and
`guessSpeciesStyle()` was exported I could simplify
`extendedMapSeqlevels()` [maybe a different name for it would be
better] for my use case.

I think that guessing the correct `style` without the `species`
information is going to get harder as the number of supported
organisms increases in `GenomeInfoDb`. Hence why I like
`guessSpeciesStyle()` more than `seqlevelsStyle()`.

Basically, I would use `guessSpeciesStyle()` to restrict the valid
`style` guesses to those corresponding to the `species` of interest,
or guess everything if nothing is supplied. Then with `mapSeqlevels()`
I could specify which mapping to use. Or alternatively use
`supportedSeqnameMappings()` if you think that `mapSeqlevels()`
shouldn't have a species argument.

Anyhow, I don't think that it's critical that you export these
functions. I don't think that there is a problem with having the
single NOTE in R CMD check.
Awesome! I think that these changes are great! I'll shortly modify my
code to use this version.
=)
Let me know if you need anything else!

Cheers,
Leo
#
On Wed, Oct 22, 2014 at 6:59 PM, Leonardo Collado Torres
<lcollado at jhu.edu> wrote:
I'm guessing that the main use of `seqlevelsStyle(x)` is when `x` is a
GRanges or GRangesList object that has the organism information. Then
the `species` can be inferred without error and making it much more
straight forward to guess the naming style used.
#
Hi,

I just found a tiny bug in .guessSpeciesStyle(). Basically, if the
style name has a dot, this function incorrectly returns the name of
the style.

See below:

## Lets guess the species and style for 'T'
$species
[1] "Populus trichocarpa"

$style
[1] "JGI2"

## Same style with the higher lvl function
[1] "JGI2"

## Ok, style 'JGI2'

## In more dtail, this matches 'LGI' in NCBI style for this organism
$Populus_trichocarpa
     NCBI JGI2.F
1     LGI      T
2    LGII      2
3   LGIII      3
4    LGIV      4
5     LGV      5
6    LGVI      6
7   LGVII      7
8  LGVIII      8
9    LGIX      9
10    LGX     10
11   LGXI     11
12  LGXII     12
13 LGXIII     13
14   LGIV     14
15   LGXV     15
16   LGVI     16
17  LGVII     17
18 LGVIII     18
19  LGXIX     19
20   Pltd   <NA>

## Guessing 'LGI' with the higher lvl function works
[1] "NCBI"

## Mapping from 'LGI' in 'NCBI' style to 'JGI2' style doesn't work
Error in mapSeqlevels("LGI", "JGI2") :
  supplied seqname style "JGI2" is not supported

## Using the correct 'style' name works.
## However a user might have expected it to work with 'JGI2' given
## that this was the output from seqlevelsStyle('T')
[1] "T"

## Although a user could find the correct name
circular auto   sex NCBI JGI2.F
1    FALSE TRUE FALSE  LGI      T
[1] ?1.3.3?

Cheers,
Leo

On Wed, Oct 22, 2014 at 7:03 PM, Leonardo Collado Torres
<lcollado at jhu.edu> wrote:
#
Hi Leo,

This bug is now fixed .
JGI2 was being incorrectly returned previously.
The correct style is JGI2.F and from now on, it will always check only 
for JGI2.F

 > library(GenomeInfoDb)
 > seqlevelsStyle("T")
[1] "JGI2.F"
 > seqlevelsStyle("LGI")
[1] "NCBI"
 > mapSeqlevels('LGI', 'JGI2.F')
[1] "T"
 > packageVersion("GenomeInfoDb")
[1] ?1.3.4?

Thanks for pointing these out.
Best,
Sonali.
On 10/23/2014 3:35 PM, Leonardo Collado Torres wrote: