[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?
Hi Gabe, I wouldn't say "wrong" but "not normalized". So maybe for example everything is ok with their seqinfo but the seqlevels are not in canonical order, and, for whatever reason, they want to keep them that way (the actual order of the seqlevels impacts the sorting of the ranges in the GRanges object). If it turns out that the genome column is empty and they want to set it, then they would be left with no way to do so. Another concern I have is that genome<- has been the low-level setter for the genome column for a while and changing its behavior now might break some existing code. It's fixable, but we can avoid that. H.
On 06/05/2015 11:51 AM, Gabe Becker wrote:
Herve,
This is probably a naive question, but what usecases are there for
creating an object with the wrong seqinfo for its genome?
~G
On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence
<lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
On Thu, Jun 4, 2015 at 11:48 PM, Herv? Pag?s <hpages at fredhutch.org
<mailto:hpages at fredhutch.org>> wrote:
> I also think that we're heading towards something like that.
>
> So genome(gr) <- "hg19" would:
>
> (a) Add any missing information to the seqinfo.
> (b) Sort the seqlevels in "canonical" order.
> (c) Change the seqlevels style to UCSC style if they are not.
>
> The 3 tasks are orthogonal. I guess most of the times people want
> an easy way to perform them all at once.
>
> We could easily support (a) and (b). This assumes that the current
> seqlevels are already valid hg19 seqlevels:
>
> si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2",
"chr6_cox_hap2"))
> gr1 <- GRanges(seqinfo=si1)
> hg19_si <- Seqinfo(genome="hg19")
>
> ## (a):
> seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
> seqinfo(gr1)
> # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
> # seqnames seqlengths isCircular genome
> # chrX 155270560 FALSE hg19
> # chrUn_gl000249 38502 FALSE hg19
> # chr2 243199373 FALSE hg19
> # chr6_cox_hap2 4795371 FALSE hg19
>
> ## (b):
> seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
> seqinfo(gr1)
> # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
> # seqnames seqlengths isCircular genome
> # chr2 243199373 FALSE hg19
> # chrX 155270560 FALSE hg19
> # chr6_cox_hap2 4795371 FALSE hg19
> # chrUn_gl000249 38502 FALSE hg19
>
> (c) is harder because seqlevelsStyle() doesn't know how to rename
> scaffolds yet:
>
> si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2",
"HSCHR6_MHC_COX_CTG1"))
> gr2 <- GRanges(seqinfo=si2)
>
> seqlevelsStyle(gr2)
> # [1] "NCBI"
>
> seqlevelsStyle(gr2) <- "UCSC"
> seqlevels(gr2)
> # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2"
> # [4] "HSCHR6_MHC_COX_CTG1"
>
> So we need to work on this.
>
> I'm not sure about using genome(gr) <- "hg19" for this. Right now
> it sets the genome column of the seqinfo with the supplied string
> and nothing else. Aren't there valid use cases for this?
Not sure. People would almost always want the seqname style and order
to be consistent with the given genome.
> What about
> using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
> whole seqinfo component actually gets filled.
>
Yea, but "genome" is so intuitive compared to "seqinfo".
> H.
>
> On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:
>>
>> that's kind of always been my goal...
>>
>>
>> Statistics is the grammar of science.
>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>
>> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
>> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>
<mailto:lawrence.michael at gene.com
<mailto:lawrence.michael at gene.com>>> wrote:
>>
>> Maybe this could eventually support setting the seqinfo with:
>>
>> genome(gr) <- "hg19"
>>
>> Or is that being too clever?
>>
>> On Thu, Jun 4, 2015 at 4:28 PM, Herv? Pag?s
<hpages at fredhutch.org <mailto:hpages at fredhutch.org>
>> <mailto:hpages at fredhutch.org <mailto:hpages at fredhutch.org>>>
wrote:
>> > Hi,
>> >
>> > FWIW I started to work on supporting quick generation of a
>> standalone
>> > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
>> >
>> > It already supports hg38, hg19, hg18, panTro4, panTro3,
panTro2,
>> > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1,
musFur1,
>> mm10,
>> > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4,
>> galGal3,
>> > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2,
sacCer3,
>> > and sacCer2. I'll add more.
>> >
>> > See ?Seqinfo for some examples.
>> >
>> > Right now it fetches the information from internet every
time you
>> > call it but maybe we should just store that information
in the
>> > GenomeInfoDb package as Tim suggested?
>> >
>> > H.
>> >
>> >
>> > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
>> >>
>> >> That would be perfect actually. And it would radically
reduce &
>> >> modularize maintenance. Maybe that's the best way to go
after
>> all. Quite
>> >> sensible.
>> >>
>> >> --t
>> >>
>> >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey
>> <stvjc at channing.harvard.edu
<mailto:stvjc at channing.harvard.edu>
<mailto:stvjc at channing.harvard.edu <mailto:stvjc at channing.harvard.edu>>>
>> >>> wrote:
>> >>>
>> >>> It really isn't hard to have multiple OrganismDb
packages in
>> place -- the
>> >>> process of making new ones is documented and was given
as an
>> exercise in
>> >>> the EdX course. I don't know if we want to
institutionalize it
>> and
>> >>> distribute such -- I think we might, so that there would be
>> Hs19, Hs38,
>> >>> mm9, etc. packages. They have very little content,
they just
>> coordinate
>> >>> interactions with packages that you'll already have.
>> >>>
>> >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr.
>> <tim.triche at gmail.com <mailto:tim.triche at gmail.com>
<mailto:tim.triche at gmail.com <mailto:tim.triche at gmail.com>>>
>>
>> >>> wrote:
>> >>>
>> >>>> Right, I typically do that too, and if you're working
on human
>> data it
>> >>>> isn't a big deal. What makes things a lot more of a
drag is
>> when you
>> >>>> work
>> >>>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38)
where
>> >>>> Mus.musculus
>> >>>> is essentially a "build ahead" of Homo.sapiens.
>> >>>>
>> >>>> R> seqinfo(Homo.sapiens)
>> >>>> Seqinfo object with 93 sequences (1 circular) from
hg19 genome
>> >>>>
>> >>>> R> seqinfo(Mus.musculus)
>> >>>> Seqinfo object with 66 sequences (1 circular) from
mm10 genome:
>> >>>>
>> >>>> It's not as explicit as directly assigning the seqinfo
from a
>> genome
>> >>>> that
>> >>>> corresponds to that of your
annotations/results/whatever. I
>> know we
>> >>>> could
>> >>>> all use crossmap or liftOver or whatever, but that's not
>> really the
>> >>>> same,
>> >>>> and it takes time, whereas assigning the proper
seqinfo for
>> >>>> relationships
>> >>>> is very fast.
>> >>>>
>> >>>> That's all I was getting at...
>> >>>>
>> >>>>
>> >>>> Statistics is the grammar of science.
>> >>>> Karl Pearson
>> >>>>
>> >>>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey
>> >>>> <stvjc at channing.harvard.edu
<mailto:stvjc at channing.harvard.edu>
<mailto:stvjc at channing.harvard.edu <mailto:stvjc at channing.harvard.edu>>
>> >>>>>
>> >>>>> wrote:
>> >>>>
>> >>>>
>> >>>>> I typically get this info from Homo.sapiens. The
result is
>> parasitic
>> >>>>> on
>> >>>>> the TxDb that is in there. I don't know how easy it
is to swap
>> >>>>> alternate
>> >>>>> TxDb in to get a different build. I think it would
make sense
>> to
>> >>>>> regard
>> >>>>> the OrganismDb instances as foundational for this sort of
>> structural
>> >>>>> data.
>> >>>>>
>> >>>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
>> >>>>> kasperdanielhansen at gmail.com
<mailto:kasperdanielhansen at gmail.com>
>> <mailto:kasperdanielhansen at gmail.com
<mailto:kasperdanielhansen at gmail.com>>> wrote:
>> >>>>>
>> >>>>>> Let me rephrase this slightly. From one POV the
purpose of
>> >>>>>> GenomeInfoDb
>> >>>>>> is
>> >>>>>> clean up the seqinfo slot. Currently it does most
of the
>> cleaning,
>> >>>>>> but
>> >>>>>> it
>> >>>>>> does not add seqlengths.
>> >>>>>>
>> >>>>>> It is clear that seqlengths depends on the version
of the
>> genome, but
>> >>>>>> I
>> >>>>>> will argue so does the seqnames. Of course, for human,
>> chr22 will not
>> >>>>>> change. But what about the names of all the random
>> contigs? Or for
>> >>>>>> other
>> >>>>>> organisms, what about going from a draft genome with 10k
>> contigs to a
>> >>>>>> more
>> >>>>>> completely genome assembled into fewer, larger
chromosomes.
>> >>>>>>
>> >>>>>> I acknowledge that this information is present in
the BSgenome
>> >>>>>> packages,
>> >>>>>> but it seems (to me) to be very appropriate to have them
>> around for
>> >>>>>> cleaning up the seqinfo slot. For some situations
it is not
>> great to
>> >>>>>> depend on 1 GB> download for something that is a few
bytes.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Kasper
>> >>>>>>
>> >>>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr.
>> <tim.triche at gmail.com <mailto:tim.triche at gmail.com>
<mailto:tim.triche at gmail.com <mailto:tim.triche at gmail.com>>>
>>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> It would be nice (for a number of reasons) to have
>> chromosome lengths
>> >>>>>>> readily available in a foundational package like
>> GenomeInfoDb, so
>> >>>>>>> that,
>> >>>>>>> say,
>> >>>>>>>
>> >>>>>>> data(seqinfo.hg19)
>> >>>>>>> seqinfo(myResults) <- seqinfo.hg19[
seqlevels(myResults) ]
>> >>>>>>>
>> >>>>>>> would work without issues. Is there any particular
reason
>> this
>> >>>>>>
>> >>>>>> couldn't
>> >>>>>>>
>> >>>>>>> happen for the supported/available BSgenomes? It would
>> seem like a
>> >>>>>>
>> >>>>>> simple
>> >>>>>>>
>> >>>>>>> matter to do
>> >>>>>>>
>> >>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19)
>> >>>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens)
>> >>>>>>> R> save(seqinfo.hg19,
>> >>>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
>> >>>>>>>
>> >>>>>>> and be done with it until (say) the next release or
next
>> released
>> >>>>>>> BSgenome. I considered looping through the following
>> BSgenomes
>> >>>>>>
>> >>>>>> myself...
>> >>>>>>>
>> >>>>>>> and if it isn't strongly opposed by (everyone) I
may still
>> do exactly
>> >>>>>>> that. Seems useful, no?
>> >>>>>>>
>> >>>>>>> e.g. for the following 42 builds,
>> >>>>>>>
>> >>>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "",
>> available.genomes())),
>> >>>>>>> value=TRUE)
>> >>>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2"
>> >>>>>>
>> >>>>>> "BSgenome.Btaurus.UCSC.bosTau3"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4"
>> >>>>>>
>> >>>>>> "BSgenome.Btaurus.UCSC.bosTau6"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8"
>> >>>>>>> "BSgenome.Celegans.UCSC.ce10"
>> >>>>>>>
>> >>>>>>> [7] "BSgenome.Celegans.UCSC.ce2"
>> "BSgenome.Celegans.UCSC.ce6"
>> >>>>>>>
>> >>>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
>> >>>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3"
>> >>>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
>> >>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3"
>> >>>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6"
>> >>>>>>
>> >>>>>> "BSgenome.Drerio.UCSC.danRer5"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [15] "BSgenome.Drerio.UCSC.danRer6"
>> >>>>>>
>> >>>>>> "BSgenome.Drerio.UCSC.danRer7"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [17] "BSgenome.Ecoli.NCBI.20080805"
>> >>>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1"
>> >>>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3"
>> >>>>>>
>> >>>>>> "BSgenome.Ggallus.UCSC.galGal4"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38"
>> >>>>>>> "BSgenome.Hsapiens.UCSC.hg17"
>> >>>>>>>
>> >>>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18"
>> >>>>>>> "BSgenome.Hsapiens.UCSC.hg19"
>> >>>>>>>
>> >>>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38"
>> >>>>>>> "BSgenome.Mfascicularis.NCBI.5.0"
>> >>>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1"
>> >>>>>>
>> >>>>>> "BSgenome.Mmulatta.UCSC.rheMac2"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3"
>> >>>>>>
>> >>>>>> "BSgenome.Mmusculus.UCSC.mm10"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8"
>> >>>>>>> "BSgenome.Mmusculus.UCSC.mm9"
>> >>>>>>>
>> >>>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
>> >>>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3"
>> >>>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4"
>> >>>>>>
>> >>>>>> "BSgenome.Rnorvegicus.UCSC.rn5"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6"
>> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1"
>> >>>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
>> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3"
>> >>>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3"
>> >>>>>>
>> >>>>>> "BSgenome.Tguttata.UCSC.taeGut1"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Am I insane for suggesting this? It would make
things a
>> little
>> >>>>>>> easier
>> >>>>>>
>> >>>>>> for
>> >>>>>>>
>> >>>>>>> rtracklayer, most SummarizedExperiment and SE-derived
>> objects, blah,
>> >>>>>>
>> >>>>>> blah,
>> >>>>>>>
>> >>>>>>> blah...
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Best,
>> >>>>>>>
>> >>>>>>> --t
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Statistics is the grammar of science.
>> >>>>>>> Karl Pearson
>> >>>>>>
>> >>>>>>
>> >>>>>> [[alternative HTML version deleted]]
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> Bioc-devel at r-project.org
<mailto:Bioc-devel at r-project.org> <mailto:Bioc-devel at r-project.org
<mailto:Bioc-devel at r-project.org>>
>> mailing list
>> >>>
>> >>>
>> >>> [[alternative HTML version deleted]]
>> >>>
>> >>> _______________________________________________
>> >>> Bioc-devel at r-project.org
<mailto:Bioc-devel at r-project.org> <mailto:Bioc-devel at r-project.org
<mailto:Bioc-devel at r-project.org>>
>> mailing list
>> >>
>> >>
>> >> _______________________________________________
>> >> Bioc-devel at r-project.org
<mailto:Bioc-devel at r-project.org> <mailto:Bioc-devel at r-project.org
<mailto:Bioc-devel at r-project.org>>
>> mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >>
>> >
>> > --
>> > Herv? Pag?s
>> >
>> > Program in Computational Biology
>> > Division of Public Health Sciences
>> > Fred Hutchinson Cancer Research Center
>> > 1100 Fairview Ave. N, M1-B514
>> > P.O. Box 19024
>> > Seattle, WA 98109-1024
>> >
>> > E-mail: hpages at fredhutch.org
<mailto:hpages at fredhutch.org> <mailto:hpages at fredhutch.org
<mailto:hpages at fredhutch.org>>
>> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
<tel:%28206%29%20667-5791>
>> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
<tel:%28206%29%20667-1319>
>> >
>> >
>> > _______________________________________________
>> > Bioc-devel at r-project.org
<mailto:Bioc-devel at r-project.org> <mailto:Bioc-devel at r-project.org
<mailto:Bioc-devel at r-project.org>>
>> mailing list
>>
>>
>
> --
> Herv? Pag?s
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org <mailto:hpages at fredhutch.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
_______________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319