Skip to content

[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

22 messages · Vincent Carey, Tim Triche, Jr., Hervé Pagès +3 more

#
Right, I typically do that too, and if you're working on human data it
isn't a big deal.  What makes things a lot more of a drag is when you work
on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where Mus.musculus
is essentially a "build ahead" of Homo.sapiens.

R> seqinfo(Homo.sapiens)
Seqinfo object with 93 sequences (1 circular) from hg19 genome

R> seqinfo(Mus.musculus)
Seqinfo object with 66 sequences (1 circular) from mm10 genome:

It's not as explicit as directly assigning the seqinfo from a genome that
corresponds to that of your annotations/results/whatever.  I know we could
all use crossmap or liftOver or whatever, but that's not really the same,
and it takes time, whereas assigning the proper seqinfo for relationships
is very fast.

That's all I was getting at...


Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>

On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey <stvjc at channing.harvard.edu>
wrote:

  
  
#
It really isn't hard to have multiple OrganismDb packages in place -- the
process of making new ones is documented and was given as an exercise in
the EdX course.  I don't know if we want to institutionalize it and
distribute such -- I think we might, so that there would be Hs19, Hs38,
mm9, etc. packages.  They have very little content, they just coordinate
interactions with packages that you'll already have.

On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. <tim.triche at gmail.com>
wrote:

  
  
#
That would be perfect actually.  And it would radically reduce & modularize maintenance.  Maybe that's the best way to go after all.  Quite sensible. 

--t
1 day later
#
Hi,

FWIW I started to work on supporting quick generation of a standalone
Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.

It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10,
mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3,
gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3,
and sacCer2. I'll add more.

See ?Seqinfo for some examples.

Right now it fetches the information from internet every time you
call it but maybe we should just store that information in the
GenomeInfoDb package as Tim suggested?

H.
On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:

  
    
#
Herve,

This sounds great.  I do think it would be highly advantageous to not use
the internet.  For example, you could imagine using this function everytime
a Granges() is created, and it would be useful not to have internet access
for this.

Best,
Kasper
On Thu, Jun 4, 2015 at 7:28 PM, Herv? Pag?s <hpages at fredhutch.org> wrote:

            

  
  
#
Maybe this could eventually support setting the seqinfo with:

genome(gr) <- "hg19"

Or is that being too clever?
On Thu, Jun 4, 2015 at 4:28 PM, Herv? Pag?s <hpages at fredhutch.org> wrote:
#
that's kind of always been my goal...


Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>

On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence <lawrence.michael at gene.com>
wrote:

  
  
#
I also think that we're heading towards something like that.

So genome(gr) <- "hg19" would:

   (a) Add any missing information to the seqinfo.
   (b) Sort the seqlevels in "canonical" order.
   (c) Change the seqlevels style to UCSC style if they are not.

The 3 tasks are orthogonal. I guess most of the times people want
an easy way to perform them all at once.

We could easily support (a) and (b). This assumes that the current
seqlevels are already valid hg19 seqlevels:

   si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2"))
   gr1 <- GRanges(seqinfo=si1)
   hg19_si <- Seqinfo(genome="hg19")

   ## (a):
   seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
   seqinfo(gr1)
   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
   #   seqnames       seqlengths isCircular genome
   #   chrX            155270560      FALSE   hg19
   #   chrUn_gl000249      38502      FALSE   hg19
   #   chr2            243199373      FALSE   hg19
   #   chr6_cox_hap2     4795371      FALSE   hg19

   ## (b):
   seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
   seqinfo(gr1)
   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
   #   seqnames       seqlengths isCircular genome
   #   chr2            243199373      FALSE   hg19
   #   chrX            155270560      FALSE   hg19
   #   chr6_cox_hap2     4795371      FALSE   hg19
   #   chrUn_gl000249      38502      FALSE   hg19

(c) is harder because seqlevelsStyle() doesn't know how to rename
scaffolds yet:

   si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2", 
"HSCHR6_MHC_COX_CTG1"))
   gr2 <- GRanges(seqinfo=si2)

   seqlevelsStyle(gr2)
   # [1] "NCBI"

   seqlevelsStyle(gr2) <- "UCSC"
   seqlevels(gr2)
   # [1] "chrX"                 "HSCHRUN_RANDOM_CTG42" "chr2" 

   # [4] "HSCHR6_MHC_COX_CTG1"

So we need to work on this.

I'm not sure about using genome(gr) <- "hg19" for this. Right now
it sets the genome column of the seqinfo with the supplied string
and nothing else. Aren't there valid use cases for this? What about
using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
whole seqinfo component actually gets filled.

H.
On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:

  
    
#
On Thu, Jun 4, 2015 at 11:48 PM, Herv? Pag?s <hpages at fredhutch.org> wrote:
Not sure. People would almost always want the seqname style and order
to be consistent with the given genome.
Yea, but "genome" is so intuitive compared to "seqinfo".
#
On 06/05/2015 11:43 AM, Michael Lawrence wrote:
Agreed but my worry is that when they don't, then they would be left
with no way to just set the genome column.

H.

  
    
#
Herve,

This is probably a naive question, but what usecases are there for creating
an object with the wrong seqinfo for its genome?

~G

On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence <lawrence.michael at gene.com

  
    
#
In WGBS we frequently sequence a human with spikein from the lambda
genome.  In this case, most of the chromosomes of the Granges are from
human, except one.  This is a usecase where genome(GR) is not constant.  I
suggest, partly for compatibility, to keep genome, but perhaps do something
like
  standardizeGenome()
or something like this.

I would indeed love, love, love a function which just cleans it up.

Kasper
On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker <becker.gabe at gene.com> wrote:

            

  
  
#
That sounds like it calls for an (class-style) inheritence/genome-union
 model to me. I should probably stop talking now before the people who
would have to implement that start throwing things at me, though.

~G

On Fri, Jun 5, 2015 at 12:54 PM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:

            

  
    
#
Hi Gabe,

I wouldn't say "wrong" but "not normalized". So maybe for example
everything is ok with their seqinfo but the seqlevels are not in
canonical order, and, for whatever reason, they want to keep them
that way (the actual order of the seqlevels impacts the sorting of
the ranges in the GRanges object). If it turns out that the genome
column is empty and they want to set it, then they would be left
with no way to do so.

Another concern I have is that genome<- has been the low-level setter
for the genome column for a while and changing its behavior now might
break some existing code. It's fixable, but we can avoid that.

H.
On 06/05/2015 11:51 AM, Gabe Becker wrote:

  
    
#
To support the multi-genome case, one could set the genome as a
vector, one value for each seqname, and it would fix the
style/seqlength per seqname. It could sort by the combination of
seqname and species. Presumably it would do nothing for unknown
genomes.

But I agree that a standardizeSeqinfo() that amounts to "genome(x) <-
genome(x)" would make sense.

I don't think people sort too often by seqnames (except to the
"natural" ordering), but I could be wrong. I do sympathize though with
the need for a low-level accessor. At least one would want a parameter
for disabling the standardization.

On Fri, Jun 5, 2015 at 12:54 PM, Kasper Daniel Hansen
<kasperdanielhansen at gmail.com> wrote:
#
On Fri, Jun 5, 2015 at 1:19 PM, Michael Lawrence <lawrence.michael at gene.com>
wrote:
That is one way. Another would be it takes a vector of the length of
genomes being unioned, and each genome knows it's seqinfo and fixes things
within it's domain. The seqlevels, for example, would be
c(seqlevels(genome1), ..., seqlevels(genomeK)). There shouldn't be overlap,
because if there is there was already an identifiability problem in the
data which is basically guaranteed to be an error (I think).

If combinations of genomes were more formally modeled, though, you could do
fun things like

genome(x, strict=FALSE) = "GRCh38"

Which would do nothing if the genome on x was already a union containing
GRCh38, and otherwise would fix the "human part" of the seqinfo to be
GRCh37, but would leave anything that had been unioned on alone.

~G

  
    
#
On 06/05/2015 01:19 PM, Michael Lawrence wrote:
That's what sort(), order(), rank() do by default: they sort by seqnames
first, then by start, then by end, and finally by strand.
Ok. So the candidates are:

  (a) standardizeSeqinfo(gr) <- "hg19"

  (b) gr <- standardizeSeqinfo(gr, "hg19")

  (c) standardizeGenome(gr) <- "hg19"

  (d) gr <- standardizeGenome(gr, "hg19")

  (e) seqinfo(gr) <- "hg19"

Is there a risk of confusion with keepStandardChromosomes where
"standard" means a very different thing? I'll add 2 more:

  (f) normalizeSeqinfo(gr) <- "hg19"

  (g) gr <- normalizeSeqinfo(gr, "hg19")

Anyway, we're not here yet. As pointed in an earlier post, there are
still some missing pieces to complete the puzzle.

Thanks,
H.

  
    
#
how about just

gr <- addSeqinfo(gr, "hg19")

Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
On Fri, Jun 5, 2015 at 1:36 PM, Herv? Pag?s <hpages at fredhutch.org> wrote:

            

  
  
#
On Fri, Jun 5, 2015 at 1:39 PM, Tim Triche, Jr. <tim.triche at gmail.com>
wrote:
Add sounds like it's, well, adding rather than replacing (Which it
sometimes would do.

gr <- fixSeqInfo(gr, "hg19")

instead?

~G
#
On 06/05/2015 01:39 PM, Tim Triche, Jr. wrote:
mmh, we don't really "add" a seqinfo. We transform the existing one.
This transformation can be called "standardization", or "normalization",
or... but I wouldn't call it "addition".

H.

  
    
#
On 06/05/2015 01:41 PM, Gabe Becker wrote:
As I tried to explain earlier, it's not broken.

Come on, we've already a list of 7 proposals! I was hoping you guys
would help pick up one instead of growing the list :-b

H.

  
    
#
Not a big fan of the gets (<-) syntax for "verb" functions.
standardizeSeqinfo() is probably the most "proper" name according to
our nomenclature. But novice users are still just going to want to do
genome(gr) <- "hg19" and have it just the right thing.
On Fri, Jun 5, 2015 at 1:36 PM, Herv? Pag?s <hpages at fredhutch.org> wrote: