Skip to content

[Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

11 messages · Hervé Pagès, Michael Lawrence, Rainer Johannes +2 more

#
I agree, I would not modify the file content. At present it is however not possible to use e.g. getSeq on these TwoBitFiles, since the chromosome names in the submitted GRanges (e.g. 1) do not match the seqnames/seqinfo of the TwoBitFile. I don?t know if a seqnames or seqinfo method stripping of all but the first name-part would help here...

jo
#
Hi Jo, Michael,

What about implementing a seqlevels() setter for TwoBitFile objects? All
you need for this is an extra slot for storing the user-supplied
seqlevels. Note that in general the seqlevels() setter allows more than
renaming the seqlevels. It also allows dropping, adding, and shuffling
them. But you don't need to support all that. Supporting renaming would
already go a long way. See selectMethod("seqlevels<-", "TxDb") in
GenomicFeatures for an example of a restricted "seqlevels<-" method.

H.
On 01/08/2016 09:50 AM, Rainer Johannes wrote:

  
    
#
That is one solution. But everyone using that genome would need to
reset the seqlevels to the "standard" ones. In this specific case, is
there any reason not to just use the BSgenome for GRCh38?
On Fri, Jan 8, 2016 at 11:04 AM, Herv? Pag?s <hpages at fredhutch.org> wrote:
#
On 01/08/2016 01:09 PM, Michael Lawrence wrote:
I agree. Maybe we don't need seqlevels<-,TwoBitFile for that particular
use case. Just wanted to mention that the ability to rename the
sequences in a TwoBitFile, FastaFile, or other file-based object that
supports seqinfo() would be useful in general.

H.

  
    
#
Yes, using BSGenome would help in this case. 
In the long run I think it might be important to have this fixed, not necessarily for human, but for other species/genome builds for which there might not be an BSGenome package available; through AnnotationHub all GTF files and fasta files would be available. Note also that the FaFiles from Ensembl do have the ?correct? chromosome names although I assume they were built from the same Ensembl fasta files than the TwoBitFiles.

jo
#
We switched to TwoBitFile with a recent ensembl release, thinking that it had better performance and other characteristics compared to the previous FaFile.

The 'recipe' used to create the FaFiles did not explicitly trim the label; that appears to be something done by Rsamtools::indexFa and hence (a now quite dated) version of samtools.

I'm not precisely sure where we stand on correcting this. The original approach just takes what we're given and makes a 2bit file. At least provisionally we had decided (after Thurs / Fri exchanges) to make the seqlevels sensible on the way in to annotation hub; this is against Sean's advice and I'm not really a big fan of this. 

I like the idea of being able to dynamically remap the seqlevels when the 2bit file is loaded by AnnotationHub, which would require Herve's suggestion of settable seqlevels on TwoBitFile.

Martin
#
I can understand the desire to avoid defining and enforcing our own
standards on third-party data: it's error-prone, potentially
confusing, etc. But the same is even more true of expecting the user
to perform the mapping via some adhoc approach.

It's unfortunate that Ensembl does not follow the convention of naming
their FASTA sequences by their seqlevels, but I'm not sure how
wide-spread that convention is in the first place.

Why does Bioconductor distribute genomes in two different ways:
BSgenome and via AnnotationHub? Couldn't those two distribution
mechanisms be unified? That might mitigate some of the maintenance
cost and better encapsulate the added complexity.

Michael

On Sat, Jan 9, 2016 at 8:12 AM, Morgan, Martin
<Martin.Morgan at roswellpark.org> wrote:
#
Also things like organismdbi don't seem to exist for organisms other than human, mouse, rat.  So if you want to use that infrastructure for fly or worms, you're SOL at the moment. 

This is a highly topical discussion since many/most microarray probes can be profitably (in terms of knowledge, not money) remapped to more contemporary or richer transcriptomes and thus used to explore the generality of findings.  The OrganismDb/BsGenome infrastructure doesn't well accommodate this use case, yet, but Zhilong's recent remarks suggest that a unified approach could be broadly useful for many investigators. 

Being a lazy bum, I tried to dump the task back on him (no good deed goes unpunished) but since Jo is also a glutton for punishment and an author of fine Ensembl support packages...

:-)

In all seriousness the generosity of the BioC community cannot be overstated. You guys are great

--t
#
On 01/09/2016 08:42 AM, Michael Lawrence wrote:
Actually, and according to https://en.wikipedia.org/wiki/FASTA_format,
it seems that:

   The word following the ">" symbol is the identifier of the sequence,
   and the rest of the line is the description (both are optional).

so Rsamtools::indexFa is doing the right thing by trimming the
description line. Maybe that's what seqinfo,TwoBitFile should do
too?

H.

  
    
#
That would be great! I don?t think we would loose any information with that behaviour. All fasta files from Ensembl have that format (id, whitespace and description). Implementing that would render the fasta files from Ensembl provided as TwoBit files via AnnotationHub usable without having to tweak the objects afterwards. I would highly appreciate that (and people working with other species than mouse/human/rat too)!

jo
#
Yes, thanks for those details, Herve. I changed rtracklayer to take
the first word as the seqlevels.

Michael

On Sun, Jan 10, 2016 at 9:50 AM, Rainer Johannes
<Johannes.Rainer at eurac.edu> wrote: