[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

Maybe this could eventually support setting the seqinfo with:

genome(gr) <- "hg19"

Or is that being too clever?
Hi,

FWIW I started to work on supporting quick generation of a standalone
Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.

It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10,
mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3,
gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3,
and sacCer2. I'll add more.

See ?Seqinfo for some examples.

Right now it fetches the information from internet every time you
call it but maybe we should just store that information in the
GenomeInfoDb package as Tim suggested?

H.

On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
That would be perfect actually.  And it would radically reduce &
modularize maintenance.  Maybe that's the best way to go after all.  Quite
sensible.

--t

On Jun 3, 2015, at 12:46 PM, Vincent Carey <stvjc at channing.harvard.edu>
wrote:

It really isn't hard to have multiple OrganismDb packages in place -- the
process of making new ones is documented and was given as an exercise in
the EdX course.  I don't know if we want to institutionalize it and
distribute such -- I think we might, so that there would be Hs19, Hs38,
mm9, etc. packages.  They have very little content, they just coordinate
interactions with packages that you'll already have.

On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. <tim.triche at gmail.com>
wrote:

Right, I typically do that too, and if you're working on human data it
isn't a big deal.  What makes things a lot more of a drag is when you
work
on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where
Mus.musculus
is essentially a "build ahead" of Homo.sapiens.

R> seqinfo(Homo.sapiens)
Seqinfo object with 93 sequences (1 circular) from hg19 genome

R> seqinfo(Mus.musculus)
Seqinfo object with 66 sequences (1 circular) from mm10 genome:

It's not as explicit as directly assigning the seqinfo from a genome
that
corresponds to that of your annotations/results/whatever.  I know we
could
all use crossmap or liftOver or whatever, but that's not really the
same,
and it takes time, whereas assigning the proper seqinfo for
relationships
is very fast.

That's all I was getting at...

Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>

On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey
<stvjc at channing.harvard.edu
wrote:

I typically get this info from Homo.sapiens.  The result is parasitic
on
the TxDb that is in there.  I don't know how easy it is to swap
alternate
TxDb in to get a different build.  I think it would make sense to
regard
the OrganismDb instances as foundational for this sort of structural
data.

On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:

Let me rephrase this slightly.  From one POV the purpose of
GenomeInfoDb
is
clean up the seqinfo slot.  Currently it does most of the cleaning,
but
it
does not add seqlengths.

It is clear that seqlengths depends on the version of the genome, but
I
will argue so does the seqnames.  Of course, for human, chr22 will not
change.  But what about the names of all the random contigs?  Or for
other
organisms, what about going from a draft genome with 10k contigs to a
more
completely genome assembled into fewer, larger chromosomes.

I acknowledge that this information is present in the BSgenome
packages,
but it seems (to me) to be very appropriate to have them around for
cleaning up the seqinfo slot.  For some situations it is not great to
depend on 1 GB> download for something that is a few bytes.

Best,
Kasper

On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. <tim.triche at gmail.com>
wrote:

It would be nice (for a number of reasons) to have chromosome lengths
readily available in a foundational package like GenomeInfoDb, so
that,
say,

data(seqinfo.hg19)
seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]

would work without issues.  Is there any particular reason this
couldn't
happen for the supported/available BSgenomes?  It would seem like a
simple
matter to do

R> library(BSgenome.Hsapiens.UCSC.hg19)
R> seqinfo.hg19 <- seqinfo(Hsapiens)
R> save(seqinfo.hg19,
file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")

and be done with it until (say) the next release or next released
BSgenome.  I considered looping through the following BSgenomes
myself...
and if it isn't strongly opposed by (everyone) I may still do exactly
that.  Seems useful, no?

e.g. for the following 42 builds,

grep("(UCSC|NCBI)", unique(gsub(".masked", "", available.genomes())),
value=TRUE)
[1] "BSgenome.Amellifera.UCSC.apiMel2"
"BSgenome.Btaurus.UCSC.bosTau3"

[3] "BSgenome.Btaurus.UCSC.bosTau4"
"BSgenome.Btaurus.UCSC.bosTau6"

[5] "BSgenome.Btaurus.UCSC.bosTau8"
"BSgenome.Celegans.UCSC.ce10"

[7] "BSgenome.Celegans.UCSC.ce2"         "BSgenome.Celegans.UCSC.ce6"

[9] "BSgenome.Cfamiliaris.UCSC.canFam2"
"BSgenome.Cfamiliaris.UCSC.canFam3"
[11] "BSgenome.Dmelanogaster.UCSC.dm2"
"BSgenome.Dmelanogaster.UCSC.dm3"
[13] "BSgenome.Dmelanogaster.UCSC.dm6"
"BSgenome.Drerio.UCSC.danRer5"

[15] "BSgenome.Drerio.UCSC.danRer6"
"BSgenome.Drerio.UCSC.danRer7"

[17] "BSgenome.Ecoli.NCBI.20080805"
"BSgenome.Gaculeatus.UCSC.gasAcu1"
[19] "BSgenome.Ggallus.UCSC.galGal3"
"BSgenome.Ggallus.UCSC.galGal4"

[21] "BSgenome.Hsapiens.NCBI.GRCh38"
"BSgenome.Hsapiens.UCSC.hg17"

[23] "BSgenome.Hsapiens.UCSC.hg18"
"BSgenome.Hsapiens.UCSC.hg19"

[25] "BSgenome.Hsapiens.UCSC.hg38"
"BSgenome.Mfascicularis.NCBI.5.0"
[27] "BSgenome.Mfuro.UCSC.musFur1"
"BSgenome.Mmulatta.UCSC.rheMac2"

[29] "BSgenome.Mmulatta.UCSC.rheMac3"
"BSgenome.Mmusculus.UCSC.mm10"

[31] "BSgenome.Mmusculus.UCSC.mm8"
"BSgenome.Mmusculus.UCSC.mm9"

[33] "BSgenome.Ptroglodytes.UCSC.panTro2"
"BSgenome.Ptroglodytes.UCSC.panTro3"
[35] "BSgenome.Rnorvegicus.UCSC.rn4"
"BSgenome.Rnorvegicus.UCSC.rn5"

[37] "BSgenome.Rnorvegicus.UCSC.rn6"
"BSgenome.Scerevisiae.UCSC.sacCer1"
[39] "BSgenome.Scerevisiae.UCSC.sacCer2"
"BSgenome.Scerevisiae.UCSC.sacCer3"
[41] "BSgenome.Sscrofa.UCSC.susScr3"
"BSgenome.Tguttata.UCSC.taeGut1"

Am I insane for suggesting this?  It would make things a little
easier
for
rtracklayer, most SummarizedExperiment and SE-derived objects, blah,
blah,
blah...

Best,

--t

Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

    [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

Thread (22 messages)