From: Vincent Carey 525-2265 <stvjc at channing.harvard.edu>
Date: Wed, 14 Mar 2007 10:19:36 -0400 (EDT)
To: Sean Davis <sdavis2 at mail.nih.gov>
Cc: <bioc-devel at stat.math.ethz.ch>, Ross Lazarus <rerla at channing.harvard.edu>
Subject: Re: [Bioc-devel] A geneSet data class for facilitating GSEA
i like this idea in principle. the RGenetics folks may have done
something in this direction.
you might want to have geneList as an abstract class, and then
extend to EntrezGeneList, RefseqGeneList and so forth so that
dispatch could work without looking into the idType ...
a version or date field might also be important
---
Vince Carey, PhD
Assoc. Prof Med (Biostatistics)
Harvard Medical School
Channing Laboratory - ph 6175252265 fa 6177311541
181 Longwood Ave Boston MA 02115 USA
stvjc at channing.harvard.edu
On Wed, 14 Mar 2007, Sean Davis wrote:
GSEA, both the specific method and the general concept, is becoming more
prevalent and important in data analysis. There have been several mentions
of including various "gene lists" for use with Category or other methods. Is
there interest in making a generic geneSet class for storing such
information? (Or does it already exist and I just haven't seen it?) I bring
this up because I think it could be quite useful to have a general solution
for the community (like the eSet class has become). A class could be as
simple as a vector of Entrez Gene IDs to something more complicated (but
perhaps a bit more useful for general consumption) like:
identifier: an identifier for the set (perhaps from a public database like
MSigDB)
title: One line title
description: free text description
species: The species to which the dataset applies
URL: from where the data were derived
MIAME: class "MIAME" object
protocol: (could be in MIAME, also) description of methods to produce
genelist
from raw data source
idType: What type of ID is stored (Entrez, Refseq, Ensembl, etc)?
geneList: vector of IDs
A simple wrapper data structure (even just a list) could then be used to
distribute the geneSets. Some methods could then be defined for converting
to an incidence matrix for use by Category, etc. But I think the most
important points from above are 1) maintaining some metadata about the
genelists and 2) standardization to reduce duplicated work. Individual
groups would then instantiate the geneSets using whatever means they see fit
(parsing MSigDB, IPI files, etc.).
Any thoughts?
Sean