[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

Fri, Jun 5, 2015 1:46 PM

Not a big fan of the gets (<-) syntax for "verb" functions.
standardizeSeqinfo() is probably the most "proper" name according to
our nomenclature. But novice users are still just going to want to do
genome(gr) <- "hg19" and have it just the right thing.

On Fri, Jun 5, 2015 at 1:36 PM, Herv? Pag?s <hpages at fredhutch.org> wrote:

On 06/05/2015 01:19 PM, Michael Lawrence wrote:

To support the multi-genome case, one could set the genome as a
vector, one value for each seqname, and it would fix the
style/seqlength per seqname. It could sort by the combination of
seqname and species. Presumably it would do nothing for unknown
genomes.

But I agree that a standardizeSeqinfo() that amounts to "genome(x) <-
genome(x)" would make sense.

I don't think people sort too often by seqnames (except to the
"natural" ordering),


That's what sort(), order(), rank() do by default: they sort by seqnames
first, then by start, then by end, and finally by strand.

but I could be wrong. I do sympathize though with
the need for a low-level accessor. At least one would want a parameter
for disabling the standardization.


Ok. So the candidates are:

 (a) standardizeSeqinfo(gr) <- "hg19"

 (b) gr <- standardizeSeqinfo(gr, "hg19")

 (c) standardizeGenome(gr) <- "hg19"

 (d) gr <- standardizeGenome(gr, "hg19")

 (e) seqinfo(gr) <- "hg19"

Is there a risk of confusion with keepStandardChromosomes where
"standard" means a very different thing? I'll add 2 more:

 (f) normalizeSeqinfo(gr) <- "hg19"

 (g) gr <- normalizeSeqinfo(gr, "hg19")

Anyway, we're not here yet. As pointed in an earlier post, there are
still some missing pieces to complete the puzzle.

Thanks,

H.

On Fri, Jun 5, 2015 at 12:54 PM, Kasper Daniel Hansen
<kasperdanielhansen at gmail.com> wrote:

In WGBS we frequently sequence a human with spikein from the lambda
genome.
In this case, most of the chromosomes of the Granges are from human,
except
one.  This is a usecase where genome(GR) is not constant.  I suggest,
partly
for compatibility, to keep genome, but perhaps do something like
   standardizeGenome()
or something like this.

I would indeed love, love, love a function which just cleans it up.

Kasper

On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker <becker.gabe at gene.com> wrote:


Herve,

This is probably a naive question, but what usecases are there for
creating
an object with the wrong seqinfo for its genome?

~G

On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence
<lawrence.michael at gene.com

wrote:

On Thu, Jun 4, 2015 at 11:48 PM, Herv? Pag?s <hpages at fredhutch.org>
wrote:

I also think that we're heading towards something like that.

So genome(gr) <- "hg19" would:

   (a) Add any missing information to the seqinfo.
   (b) Sort the seqlevels in "canonical" order.
   (c) Change the seqlevels style to UCSC style if they are not.

The 3 tasks are orthogonal. I guess most of the times people want
an easy way to perform them all at once.

We could easily support (a) and (b). This assumes that the current
seqlevels are already valid hg19 seqlevels:

   si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2",
"chr6_cox_hap2"))
   gr1 <- GRanges(seqinfo=si1)
   hg19_si <- Seqinfo(genome="hg19")

   ## (a):
   seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
   seqinfo(gr1)
   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
   #   seqnames       seqlengths isCircular genome
   #   chrX            155270560      FALSE   hg19
   #   chrUn_gl000249      38502      FALSE   hg19
   #   chr2            243199373      FALSE   hg19
   #   chr6_cox_hap2     4795371      FALSE   hg19

   ## (b):
   seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
   seqinfo(gr1)
   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
   #   seqnames       seqlengths isCircular genome
   #   chr2            243199373      FALSE   hg19
   #   chrX            155270560      FALSE   hg19
   #   chr6_cox_hap2     4795371      FALSE   hg19
   #   chrUn_gl000249      38502      FALSE   hg19

(c) is harder because seqlevelsStyle() doesn't know how to rename
scaffolds yet:

   si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2",

"HSCHR6_MHC_COX_CTG1"))

   gr2 <- GRanges(seqinfo=si2)

   seqlevelsStyle(gr2)
   # [1] "NCBI"

   seqlevelsStyle(gr2) <- "UCSC"
   seqlevels(gr2)
   # [1] "chrX"                 "HSCHRUN_RANDOM_CTG42" "chr2"
   # [4] "HSCHR6_MHC_COX_CTG1"

So we need to work on this.

I'm not sure about using genome(gr) <- "hg19" for this. Right now
it sets the genome column of the seqinfo with the supplied string
and nothing else. Aren't there valid use cases for this?


Not sure. People would almost always want the seqname style and order
to be consistent with the given genome.

What about
using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
whole seqinfo component actually gets filled.

Yea, but "genome" is so intuitive compared to "seqinfo".

H.

On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:


that's kind of always been my goal...


Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>

On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
<lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:

     Maybe this could eventually support setting the seqinfo with:

     genome(gr) <- "hg19"

     Or is that being too clever?

     On Thu, Jun 4, 2015 at 4:28 PM, Herv? Pag?s
<hpages at fredhutch.org
     <mailto:hpages at fredhutch.org>> wrote:

      > Hi,
      >
      > FWIW I started to work on supporting quick generation of a

standalone

      > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
      >
      > It already supports hg38, hg19, hg18, panTro4, panTro3,

panTro2,

      > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1,

musFur1,
mm10,

      > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4,

galGal3,

      > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2,

sacCer3,

      > and sacCer2. I'll add more.
      >
      > See ?Seqinfo for some examples.
      >
      > Right now it fetches the information from internet every time

you

      > call it but maybe we should just store that information in

the

      > GenomeInfoDb package as Tim suggested?
      >
      > H.
      >
      >
      > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:

      >>
      >> That would be perfect actually.  And it would radically

reduce &

      >> modularize maintenance.  Maybe that's the best way to go

after
     all.  Quite

      >> sensible.
      >>
      >> --t
      >>

      >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey

     <stvjc at channing.harvard.edu <mailto:stvjc at channing.harvard.edu>>

      >>> wrote:
      >>>
      >>> It really isn't hard to have multiple OrganismDb packages

in
     place -- the

      >>> process of making new ones is documented and was given as

an
     exercise in

      >>> the EdX course.  I don't know if we want to

institutionalize
it
and

      >>> distribute such -- I think we might, so that there would be

     Hs19, Hs38,

      >>> mm9, etc. packages.  They have very little content, they

just
     coordinate

      >>> interactions with packages that you'll already have.
      >>>
      >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr.

     <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>

      >>> wrote:
      >>>

      >>>> Right, I typically do that too, and if you're working on

human
     data it

      >>>> isn't a big deal.  What makes things a lot more of a drag

is
     when you

      >>>> work
      >>>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38)

where

      >>>> Mus.musculus
      >>>> is essentially a "build ahead" of Homo.sapiens.
      >>>>
      >>>> R> seqinfo(Homo.sapiens)
      >>>> Seqinfo object with 93 sequences (1 circular) from hg19

genome

      >>>>
      >>>> R> seqinfo(Mus.musculus)
      >>>> Seqinfo object with 66 sequences (1 circular) from mm10

genome:

      >>>>
      >>>> It's not as explicit as directly assigning the seqinfo

from
a
     genome

      >>>> that
      >>>> corresponds to that of your annotations/results/whatever.

I
     know we

      >>>> could
      >>>> all use crossmap or liftOver or whatever, but that's not

     really the

      >>>> same,
      >>>> and it takes time, whereas assigning the proper seqinfo

for

      >>>> relationships
      >>>> is very fast.
      >>>>
      >>>> That's all I was getting at...
      >>>>
      >>>>
      >>>> Statistics is the grammar of science.
      >>>> Karl Pearson

<http://en.wikipedia.org/wiki/The_Grammar_of_Science>

      >>>>
      >>>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey
      >>>> <stvjc at channing.harvard.edu <mailto:

stvjc at channing.harvard.edu>

      >>>>>
      >>>>> wrote:

      >>>>
      >>>>

      >>>>> I typically get this info from Homo.sapiens.  The result

is
     parasitic

      >>>>> on
      >>>>> the TxDb that is in there.  I don't know how easy it is

to

swap

      >>>>> alternate
      >>>>> TxDb in to get a different build.  I think it would make

sense

to

      >>>>> regard
      >>>>> the OrganismDb instances as foundational for this sort of

     structural

      >>>>> data.
      >>>>>
      >>>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
      >>>>> kasperdanielhansen at gmail.com

     <mailto:kasperdanielhansen at gmail.com>> wrote:

      >>>>>

      >>>>>> Let me rephrase this slightly.  From one POV the purpose

of

      >>>>>> GenomeInfoDb
      >>>>>> is
      >>>>>> clean up the seqinfo slot.  Currently it does most of

the
     cleaning,

      >>>>>> but
      >>>>>> it
      >>>>>> does not add seqlengths.
      >>>>>>
      >>>>>> It is clear that seqlengths depends on the version of

the
     genome, but

      >>>>>> I
      >>>>>> will argue so does the seqnames.  Of course, for human,

     chr22 will not

      >>>>>> change.  But what about the names of all the random

     contigs?  Or for

      >>>>>> other
      >>>>>> organisms, what about going from a draft genome with 10k

     contigs to a

      >>>>>> more
      >>>>>> completely genome assembled into fewer, larger

chromosomes.

      >>>>>>
      >>>>>> I acknowledge that this information is present in the

BSgenome

      >>>>>> packages,
      >>>>>> but it seems (to me) to be very appropriate to have them

     around for

      >>>>>> cleaning up the seqinfo slot.  For some situations it is

not
     great to

      >>>>>> depend on 1 GB> download for something that is a few

bytes.

      >>>>>>
      >>>>>> Best,
      >>>>>> Kasper
      >>>>>>
      >>>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr.

     <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>

      >>>>>> wrote:
      >>>>>>

      >>>>>>> It would be nice (for a number of reasons) to have

     chromosome lengths

      >>>>>>> readily available in a foundational package like

     GenomeInfoDb, so

      >>>>>>> that,
      >>>>>>> say,
      >>>>>>>
      >>>>>>> data(seqinfo.hg19)
      >>>>>>> seqinfo(myResults) <- seqinfo.hg19[

seqlevels(myResults)
]

      >>>>>>>
      >>>>>>> would work without issues.  Is there any particular

reason
this

      >>>>>>
      >>>>>> couldn't

      >>>>>>>
      >>>>>>> happen for the supported/available BSgenomes?  It would

     seem like a

      >>>>>>
      >>>>>> simple

      >>>>>>>
      >>>>>>> matter to do
      >>>>>>>
      >>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19)
      >>>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens)
      >>>>>>> R> save(seqinfo.hg19,
      >>>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
      >>>>>>>
      >>>>>>> and be done with it until (say) the next release or

next
     released

      >>>>>>> BSgenome.  I considered looping through the following

BSgenomes

      >>>>>>
      >>>>>> myself...

      >>>>>>>
      >>>>>>> and if it isn't strongly opposed by (everyone) I may

still
     do exactly

      >>>>>>> that.  Seems useful, no?
      >>>>>>>
      >>>>>>> e.g. for the following 42 builds,
      >>>>>>>
      >>>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "",

     available.genomes())),

      >>>>>>> value=TRUE)
      >>>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2"

      >>>>>>
      >>>>>> "BSgenome.Btaurus.UCSC.bosTau3"

      >>>>>>>
      >>>>>>>
      >>>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4"

      >>>>>>
      >>>>>> "BSgenome.Btaurus.UCSC.bosTau6"

      >>>>>>>
      >>>>>>>
      >>>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8"
      >>>>>>> "BSgenome.Celegans.UCSC.ce10"
      >>>>>>>
      >>>>>>> [7] "BSgenome.Celegans.UCSC.ce2"

       "BSgenome.Celegans.UCSC.ce6"

      >>>>>>>
      >>>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
      >>>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3"
      >>>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
      >>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3"
      >>>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6"

      >>>>>>
      >>>>>> "BSgenome.Drerio.UCSC.danRer5"

      >>>>>>>
      >>>>>>>
      >>>>>>> [15] "BSgenome.Drerio.UCSC.danRer6"

      >>>>>>
      >>>>>> "BSgenome.Drerio.UCSC.danRer7"

      >>>>>>>
      >>>>>>>
      >>>>>>> [17] "BSgenome.Ecoli.NCBI.20080805"
      >>>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1"
      >>>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3"

      >>>>>>
      >>>>>> "BSgenome.Ggallus.UCSC.galGal4"

      >>>>>>>
      >>>>>>>
      >>>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38"
      >>>>>>> "BSgenome.Hsapiens.UCSC.hg17"
      >>>>>>>
      >>>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18"
      >>>>>>> "BSgenome.Hsapiens.UCSC.hg19"
      >>>>>>>
      >>>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38"
      >>>>>>> "BSgenome.Mfascicularis.NCBI.5.0"
      >>>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1"

      >>>>>>
      >>>>>> "BSgenome.Mmulatta.UCSC.rheMac2"

      >>>>>>>
      >>>>>>>
      >>>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3"

      >>>>>>
      >>>>>> "BSgenome.Mmusculus.UCSC.mm10"

      >>>>>>>
      >>>>>>>
      >>>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8"
      >>>>>>> "BSgenome.Mmusculus.UCSC.mm9"
      >>>>>>>
      >>>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
      >>>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3"
      >>>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4"

      >>>>>>
      >>>>>> "BSgenome.Rnorvegicus.UCSC.rn5"

      >>>>>>>
      >>>>>>>
      >>>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6"
      >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1"
      >>>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
      >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3"
      >>>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3"

      >>>>>>
      >>>>>> "BSgenome.Tguttata.UCSC.taeGut1"

      >>>>>>>
      >>>>>>>
      >>>>>>>
      >>>>>>>
      >>>>>>>
      >>>>>>> Am I insane for suggesting this?  It would make things

a
little

      >>>>>>> easier

      >>>>>>
      >>>>>> for

      >>>>>>>
      >>>>>>> rtracklayer, most SummarizedExperiment and SE-derived

     objects, blah,

      >>>>>>
      >>>>>> blah,

      >>>>>>>
      >>>>>>> blah...
      >>>>>>>
      >>>>>>>
      >>>>>>> Best,
      >>>>>>>
      >>>>>>> --t
      >>>>>>>
      >>>>>>>
      >>>>>>>
      >>>>>>>
      >>>>>>> Statistics is the grammar of science.
      >>>>>>> Karl Pearson

     <http://en.wikipedia.org/wiki/The_Grammar_of_Science>

      >>>>>>
      >>>>>>
      >>>>>>         [[alternative HTML version deleted]]
      >>>>>>
      >>>>>> _______________________________________________
      >>>>>> Bioc-devel at r-project.org

<mailto:Bioc-devel at r-project.org>
     mailing list

      >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel

      >>>
      >>>
      >>>     [[alternative HTML version deleted]]
      >>>
      >>> _______________________________________________
      >>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>

     mailing list

      >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel

      >>
      >>
      >> _______________________________________________
      >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>

     mailing list

      >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
      >>

      >
      > --
      > Herv? Pag?s
      >
      > Program in Computational Biology
      > Division of Public Health Sciences
      > Fred Hutchinson Cancer Research Center
      > 1100 Fairview Ave. N, M1-B514
      > P.O. Box 19024
      > Seattle, WA 98109-1024
      >
      > E-mail: hpages at fredhutch.org <mailto:hpages at fredhutch.org>
      > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
      > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
      >
      >
      > _______________________________________________
      > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>

     mailing list

      > https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

Thread (22 messages)