[Bioc-devel] 'semantically rich' subsetting of SummarizedExperiments
On Mon, Oct 13, 2014 at 9:44 PM, Herv? Pag?s <hpages at fhcrc.org> wrote:
Hi, On 10/11/2014 02:25 PM, Vincent Carey wrote:
On Sat, Oct 11, 2014 at 5:17 PM, Michael Lawrence < lawrence.michael at gene.com
wrote:
But what it would do exactly?
Probably would want to be able to extract a gene list from a TxDb, then extract the desired type of structure from the TxDb. Not too bad right now, but it would be nice to leverage the identifier type information on the gene list object. Currently: tx <- transcripts(txdb, vals=list(gene_id=genes)) Proposed: tx <- transcripts(txdb[GeneList])
yes, that makes sense. i don't go to txdb's as naturally as i should.
Also coming a little late to the party, but I also have a preference for Kasper's proposal of using subsetByXXX. Supporting 'txdb[GeneList]' is arbitrarily making gene ids special, when a TxDb contains other ids (transcript and exon ids).
My proposal was in the context of having formal vectors of IDs, as Gabe has done (internally as of yet). Basically, extending a character vector to track the type of ID. GSEABase has something similar. I agree plain old character vectors make no sense here.
Also if you push a little bit this concept, you quickly run into
some semantic headaches:
- First, let's keep in mind that for a common track like the
"UCSC Genes" track, a lot of transcripts are not linked to any
gene.
- Then, allowing subsetting a TxDb by a character vector means
a TxDb has names. At least conceptually. So it's tempting to
also support 'names(txdb)' (would return all the gene ids).
- Finally, the names being unique, it seems natural to expect that
'txdb[names(txdb)]' is a no-op. But it won't because
'txdb[names(txdb)]' will drop all the transcripts that are not
linked to a gene.
But before any TxDb subsetting can happen (via [ or subsetByXXX), we
need to bring back the classic (and healthier) pass-by-value semantic
on these objects. (Right now TxDb is a reference class and thus TxDb
objects have a pass-by-reference semantic.)
H.
On Sat, Oct 11, 2014 at 10:49 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote: On 10/11/2014 08:41 AM, Vincent Carey wrote:
Is there anything on the order of as([GeneSet], "GRanges") around?
no, I don't think so; obviously of use and following a common theme. Martin On Sat, Sep 20, 2014 at 11:34 PM, Gabe Becker <becker.gabe at gene.com>
wrote: Sean and Vincent,
The goal of what we are doing builds off of what Martin has in GSEABase. We were looking to see how much benefit we can get with something lighter-weight that lies between indistinguishable character vectors and the full machinery of GeneSets. Either way, it seems like formalizing the semantic information is a way to do what you want. Furthermore, these classed id objects can be created automatically when there is contextual information e.g. during queries to databases (or db-like objects), and then simply added to metadata DataFrames and re-used. ~G On Sat, Sep 20, 2014 at 12:19 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
On Sat, Sep 20, 2014 at 3:11 PM, Gabe Becker <becker.gabe at gene.com> wrote: Hey all,
We are in the (very) early stages of experimenting with something
that
seems relevant here: classed identifiers. We are using them for
database/mart queries, but the same concept could be useful for the
cases
you're describing I think.
E.g.
mysyms = GeneSymbol(c("BRAF", "BRCA1"))
mysyms An object of class "GeneSymbol"
[1] "BRAF" "BRCA1" yourSE[mysyms, ]
...
This approach has the flavor of some of the functionality that
Martin put together for the GSEABase package (EntrezIdentifier, etc.). Sean This approach has the benefit of being declarative instead of
heuristic (people won't be able to accidentally invoke it), while still giving most of the convenience I believe you are looking for. The object classes inherit directly from character, so should "just work" most of the time, but as I said it's early days; lots more testing for functionality and usefulness is needed. ~G On Sat, Sep 20, 2014 at 11:38 AM, Vincent Carey < stvjc at channing.harvard.edu> wrote: OK by me to leave [ alone. We could start with subsetByEntrez,
subsetByKEGG, subsetBySymbol, subsetByGOTERM, subsetByGOID. Utilities to generate GRanges for queries in each of these vocabularies should, perhaps, be in the OrganismDb space? Once those are in place no additional infrastructure is necessary? On Sat, Sep 20, 2014 at 12:49 PM, Tim Triche, Jr. < tim.triche at gmail.com>
wrote:
Agreed with Sean, having tried implementing to "magical" alternative
--t On Sep 20, 2014, at 9:31 AM, Sean Davis <sdavis2 at mail.nih.gov>
wrote:
Hi, Vince.
I'm coming a little late to the party, but I agree with Kasper's sentiment
that the less "magical" approach of using subsetByXXX might be
the cleaner
way to go for the time being.
Sean On Sat, Sep 20, 2014 at 10:42 AM, Vincent Carey < stvjc at channing.harvard.edu>
wrote:
vignettes/SEresolver.Rnw
shows some modifications to [ that allow subsetting of SE by
gene or pathway name it may be premature to work at the [ level. Kasper suggested defining
a suite of subsetBy operations that would accomplish this
i think we could get something along these lines into the release without
too much more work. votes?
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Computational Biologist
Genentech Research
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Computational Biologist
Genentech Research
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319