[Bioc-devel] A geneSet data class for facilitating GSEA
On Wednesday 14 March 2007 12:59, Seth Falcon wrote:
Hi,
On Wed, 14 Mar 2007, Sean Davis wrote:
GSEA, both the specific method and the general concept, is becoming more prevalent and important in data analysis. There have been several mentions of including various "gene lists" for use with Category or other methods. Is there interest in making a generic geneSet class for storing such information? (Or does it already exist
I also think this is a good idea and is something we (BioC Seattle
group) are wiling to help with.
It looks like the class defined in the soon-to-be-in-devel PGSEA
package is very close to what is wanted. Having had a brief look at
PGSEA it looks like a delimited format is defined for reading/writing
gene set objects.
Since the gene sets on the Broad's website__ already provide a simple
XML format, I think it would be nice to be able to read and write that
format. And we should make sure we have corresponding slots for the
fields they use:
Standard name # name of set
LSID # ID of set
Brief description
Collection # collection ID
Full description or Abstract
Publication URL
External links
Organism
Contributed by
Source platform
Genes
__ http://www.broad.mit.edu/gsea/msigdb/cards/chr16q24.html
I think the collection ID makes a lot of sense since some gene sets
are really sets of gene sets like GO and cytogenetic bands.
One concern with this approach is that for sets of gene sets (again,
GO or cytogenetic bands) we will have a fair amount of duplication.
But I'm not sure it will be a problem.
I agree that these are all close. I was thinking of keeping the collections as a separate higher-level data structure. However, an email off-list I got suggested that a geneSet could be composed of a set of ID's OR another set of geneSets. A collection would then be a set of geneSets that are related in some way. The interpretation is straightforward--a geneSet becomes the union of all unique IDs in the contained geneSets. So a maintainer could choose to code chr16q as a combination of all the geneSets for the bands of 16q, or simply make one large vector of IDs. Either would be work for downstream processing. What is more problematic is an API for getting at individual geneSets (I want 16q24, but how do I get there if I need to go through chr16 and 16q24) embedded in a higher-level set in such a setup. I'm inclined to think that hierarchical geneSets might be too complicated to want to deal with, but Seth and the Bioc folks would know best.
I'm not sure yet whether ID-type specific subclasses will make things easier or not. I am certain that we will be able to add some smarts to how the annotation is dealt with to allow at least some basic translation between IDs such as Entrez and gene symbol.
I agree. The one point that Vince's email makes, though, is that it would be necessary to standardize the nomenclature for the various gene ID types if there is any hope of introducing "smarts" in dealing with translation. One way is to subclass, but the other is to validate any idType slot with agreed-upon types.
Perhaps we should start a wiki page to hammer out a class definition?
Sounds great. Sean