[Bioc-devel] 'semantically rich' subsetting of SummarizedExperiments

Hi,

On 10/11/2014 02:25 PM, Vincent Carey wrote:

On Sat, Oct 11, 2014 at 5:17 PM, Michael Lawrence <
lawrence.michael at gene.com

wrote:

 But what it would do exactly?
Probably would want to be able to extract a gene list from a TxDb, then
extract the desired type of structure from the TxDb.

Not too bad right now, but it would be nice to leverage the identifier
type information on the gene list object.

Currently:
tx <- transcripts(txdb, vals=list(gene_id=genes))

Proposed:
tx <- transcripts(txdb[GeneList])

yes, that makes sense.  i don't go to txdb's as naturally as i should.

Also coming a little late to the party, but I also have a preference
for Kasper's proposal of using subsetByXXX.

Supporting 'txdb[GeneList]' is arbitrarily making gene ids special,
when a TxDb contains other ids (transcript and exon ids).

My proposal was in the context of having formal vectors of IDs, as Gabe has
done (internally as of yet). Basically, extending a character vector to
track the type of ID. GSEABase has something similar. I agree plain old
character vectors make no sense here.
Also if you push a little bit this concept, you quickly run into
some semantic headaches:

  - First, let's keep in mind that for a common track like the
    "UCSC Genes" track, a lot of transcripts are not linked to any
    gene.

  - Then, allowing subsetting a TxDb by a character vector means
    a TxDb has names. At least conceptually. So it's tempting to
    also support 'names(txdb)' (would return all the gene ids).

  - Finally, the names being unique, it seems natural to expect that
    'txdb[names(txdb)]' is a no-op. But it won't because
    'txdb[names(txdb)]' will drop all the transcripts that are not
    linked to a gene.

But before any TxDb subsetting can happen (via [ or subsetByXXX), we
need to bring back the classic (and healthier) pass-by-value semantic
on these objects. (Right now TxDb is a reference class and thus TxDb
objects have a pass-by-reference semantic.)

H.

On Sat, Oct 11, 2014 at 10:49 AM, Martin Morgan <mtmorgan at fhcrc.org>
wrote:

 On 10/11/2014 08:41 AM, Vincent Carey wrote:
 Is there anything on the order of as([GeneSet], "GRanges") around?

no, I don't think so; obviously of use and following a common theme.
Martin

 On Sat, Sep 20, 2014 at 11:34 PM, Gabe Becker <becker.gabe at gene.com>
wrote:

  Sean and Vincent,

The goal of what we are doing builds off of what Martin has in
GSEABase.
We were looking to see how much benefit we can get with something
lighter-weight that lies between indistinguishable character vectors
and
the full machinery of GeneSets.

Either way, it seems like formalizing the semantic information is a
way
to
do what you want. Furthermore, these classed id objects can be created
automatically when there is contextual information e.g. during queries
to
databases (or db-like objects), and then simply added to metadata
DataFrames and re-used.

~G

On Sat, Sep 20, 2014 at 12:19 PM, Sean Davis <sdavis2 at mail.nih.gov>
wrote:

On Sat, Sep 20, 2014 at 3:11 PM, Gabe Becker <becker.gabe at gene.com>
wrote:

  Hey all,

We are in the (very) early stages of experimenting with something
that
seems relevant here: classed identifiers. We are using them for
database/mart queries, but the same concept could be useful for the
cases
you're describing I think.

E.g.

  mysyms = GeneSymbol(c("BRAF", "BRCA1"))

mysyms

 An object of class "GeneSymbol"
[1] "BRAF"  "BRCA1"

 yourSE[mysyms, ]
 ...

  This approach has the flavor of some of the functionality that

Martin put
together for the GSEABase package (EntrezIdentifier, etc.).

Sean

 This approach has the benefit of being declarative instead of
heuristic
(people won't be able to accidentally invoke it), while still giving
most
of the convenience I believe you are looking for.

The object classes inherit directly from character, so should "just
work"
most of the time, but as I said it's early days; lots more testing
for
functionality and usefulness is needed.

~G

On Sat, Sep 20, 2014 at 11:38 AM, Vincent Carey <
stvjc at channing.harvard.edu>
wrote:

  OK by me to leave [ alone.  We could start with subsetByEntrez,

subsetByKEGG, subsetBySymbol, subsetByGOTERM, subsetByGOID.

Utilities to generate GRanges for queries in each of these
vocabularies
should, perhaps, be in the OrganismDb space?  Once those are in
place
no additional infrastructure is necessary?

On Sat, Sep 20, 2014 at 12:49 PM, Tim Triche, Jr. <

 tim.triche at gmail.com>
 wrote:
  Agreed with Sean, having tried implementing to "magical"
alternative

--t

  On Sep 20, 2014, at 9:31 AM, Sean Davis <sdavis2 at mail.nih.gov>

 wrote:

 Hi, Vince.
I'm coming a little late to the party, but I agree with Kasper's

 sentiment
 that the less "magical" approach of using subsetByXXX might be
the

 cleaner
 way to go for the time being.
Sean

On Sat, Sep 20, 2014 at 10:42 AM, Vincent Carey <

 stvjc at channing.harvard.edu>
 wrote:

   https://github.com/vjcitn/biocMultiAssay/blob/master/

vignettes/SEresolver.Rnw

 shows some modifications to [ that allow subsetting of SE by
gene or pathway name

it may be premature to work at the [ level.  Kasper suggested

 defining

 a suite of subsetBy operations that would accomplish this

i think we could get something along these lines into the
release

 without

 too much more work.  votes?

         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

     [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

           [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Computational Biologist
Genentech Research

          [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Computational Biologist
Genentech Research

          [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

[Bioc-devel] 'semantically rich' subsetting of SummarizedExperiments

Thread (6 messages)