[Bioc-devel] oligo package, SNP+expr data - Bioc-devel

Tue, Feb 6, 2007 9:57 AM #

Hello ...

I'll put a disclaimer up front that I'm extremely new to the world of SNP
data, so some of this might be overly naive.  I've been looking at the
oligo package with an eye to put it to use to represent datasets involving
SNP and expression data over the same samples.

As background:  We have an application which currently has as its main
purpose the ability to store multiple gene expression datasets in a
database and then provides a web front end which allows users to select
probes/samples across multiple datasets fairly quickly using generic
queries ("all probes involved w/ the apoptosis pathway", "any sample that
is ER+", etc).  The database is populated using ExpressionSet objects,
which was also the model used for designing the DB tables.

What we'd like to do now is to allow SNP data to reside in this databaseas
well, and then provide ways to interact with it.  The dataset that started
this ball rolling has both SNP and expression data for the same samples,
and the investigators would like to be able to tie this information
together - so beyond just having SNP and expression data supported, we'd
also like to provide mechanisms for linking these.  

On the SNP side of the data, at least at the moment, we'd like to be able
to represent Affy call information as well as copy number or an intensity
value.  

To get the ball rolling, I took a look at the oligo package to get a sense
for what containers it currently had, and how they worked.  There were
four in particular that caught my eye:

- The SnpCallSet:  Looks to essentially be an eSet object, but 
replacing the expression matrix with a matrix of the calls

- The SnpCopyNumberSet: Same, but with copy #

- oligoSnpSet: A container which would hold both calls & copy # (correct?)

- SnpQSet: This I'm not sure what it represents, but is the output of the
snprma() functionality

For starters, was looking for confirmation that the above information is
actually correct (or not) :)  After that, I'm looking to start moving
towards some form of unified container -> it looks like this oligoSnpSet
would hold the information we desire on the SNP side, and then perhaps a
new class which contains both the ExpressionSet and the
oligoSnpSet?  Other ideas on how to model this type of dataset?

Thanks
-J

Sean Davis

Tue, Feb 6, 2007 10:35 AM #

On Tuesday 06 February 2007 12:57, Jeff Gentry wrote:

Hello ...

I'll put a disclaimer up front that I'm extremely new to the world of SNP
data, so some of this might be overly naive.  I've been looking at the
oligo package with an eye to put it to use to represent datasets involving
SNP and expression data over the same samples.

As background:  We have an application which currently has as its main
purpose the ability to store multiple gene expression datasets in a
database and then provides a web front end which allows users to select
probes/samples across multiple datasets fairly quickly using generic
queries ("all probes involved w/ the apoptosis pathway", "any sample that
is ER+", etc).  The database is populated using ExpressionSet objects,
which was also the model used for designing the DB tables.

What we'd like to do now is to allow SNP data to reside in this databaseas
well, and then provide ways to interact with it.  The dataset that started
this ball rolling has both SNP and expression data for the same samples,
and the investigators would like to be able to tie this information
together - so beyond just having SNP and expression data supported, we'd
also like to provide mechanisms for linking these.

On the SNP side of the data, at least at the moment, we'd like to be able
to represent Affy call information as well as copy number or an intensity
value.

To get the ball rolling, I took a look at the oligo package to get a sense
for what containers it currently had, and how they worked.  There were
four in particular that caught my eye:

- The SnpCallSet:  Looks to essentially be an eSet object, but
replacing the expression matrix with a matrix of the calls

- The SnpCopyNumberSet: Same, but with copy #

- oligoSnpSet: A container which would hold both calls & copy # (correct?)

- SnpQSet: This I'm not sure what it represents, but is the output of the
snprma() functionality

For starters, was looking for confirmation that the above information is
actually correct (or not) :)  After that, I'm looking to start moving
towards some form of unified container -> it looks like this oligoSnpSet
would hold the information we desire on the SNP side, and then perhaps a
new class which contains both the ExpressionSet and the
oligoSnpSet?  Other ideas on how to model this type of dataset?

I've thought about this a bit, but have never settled on a general framework 
for solving the problem.  The same issues come up with mapping between 
methylation data, chipchip data, snp data, CGH data, expression data, and 
others that most folks don't want to think about.  What I've come closest to 
settling on is a "mapper object" that sits between the two classes 
representing the different datatypes.  The "mapper object" gives a mapping 
between samples and features in the two classes, as they are likely to be 
many-to-many in general, particularly on the feature side.  This "mapper 
object" could be pretty simple, perhaps as simple as 2*(n-1) dataframes 
(where n is the number of mapped classes)--one set for mapping features and 
one for mapping samples, each based on the featureNames and sampleNames, 
respectively.  An initialize method would simply check the integrity of the 
supplied mappings against the supplied classes.  There would be some API 
issues to work out, particularly if there are more than 2 classes (snp, cgh, 
expression, for example) involved.  Thoughts?

Sean

Benilton Carvalho

Tue, Feb 6, 2007 10:41 AM #

Hi Jeff,

the SnpCallSet also has a container for the confidence associated to  
the call.

Same as above (a container for confidence)

Correct.

The SnpQSet (the "Q" stands for "quantification") contains the  
summaries for the SNPs. The approach that we use summarizes each  
featureset to 4 numbers (alleles A and B / strands sense and antisense).

Does that help you?

cheers,
b