Skip to content

[Bioc-devel] Biobase / eSet changes for this release

10 messages · Martin Morgan, Rafael A Irizarry, Rafael A. Irizarry +2 more

#
Biobase/eSet developers,

Here is a brief summary of the version of eSet to be included in the
this release; the code builds and checks without error, though missing
documentation (to be corrected within the week) mean that there are
still warning messages during check.  The most recent changes are in
svn.

There is one very recent change, to the overall class structure, that
we agonized over a great deal before making at the last moment.  We
recognize that this is very unfortunate timing, and that it will cause
needless work for bioconductors; we will help out as much as possible.


There are three major changes:

1. Change in class structure.

eSet -- VIRTUAL
  ExpressionSet
  SnpSet
  (TilingSet -- not implemented)

The main functionality of eSet is to coordinate assayData, phenoData,
experimentData, and the annoation.  eSet is also a generalized
container, with high-throughput data stored in the assayData
slot. eSet is a VIRTUAL class; if you want to store and manipulate a
consistent set of elements in the assay data slot you should create a
subclass of eSet. An example of how to do this is below.

ExpressionSet requires that the assayData slot contain matrix element
'exprs'; other elements (of dimension identical to exprs) are
permitted. as(exprSet, "ExpressionSet") coerces exprSet objects to
ExpressionSet, perhaps issuing warnings if ambiguities arise.

library(Biobase)
data(sample.exprSet)
obj <- as(sample.exprSet, "ExpressionSet")
obj

SnpSet is meant to contain SNP data in a manner analogous to
ExpressionSet; 'call' and 'callProbability' are required assayData
elements providing information on the call and a statement of
confidence in the call. The exact structure of these matricies is not
specified, but the idea is that 'call' encodes diploid genotypes.

2. Change in assayData storage

The assayData slot is an AssayData class union of 'list' and
'environment'; as a class union, there is no 'initialize'
method. Instead, the list or environment can be populated with
elements using a call to assayDataNew(...).

An innovation is the storageMode method, which can be used to change
how elements in assayData are stored. In particular the storageMode
can be 'lockedEnvironment', and indeed this is the default. An
environment is locked in the sense that new elements cannot be added
to the environment, and existing elements cannot be changed. This
means that the pass-by-reference semantics of environments will not
catch users off-guard:

obj <- as(sample.exprSet,"ExpressionSet") # default: lockedEnvironment
storageMode(obj) <- "environment"
obj1 <- obj
exprs(obj1) <- exprs(obj1)[1:10,1:5]
dims(obj) # yikes! obj exprs dimensions changed!

obj <- as(sample.exprSet,"ExpressionSet") # default: lockedEnvironment
storageMode(obj) <- "environment"
obj1 <- obj
exprs(obj1) <- log(exprs(obj))
identical(exprs(obj1),exprs(obj)) # TRUE: yikes again!

obj <- as(sample.exprSet,"ExpressionSet") # default: lockedEnvironment
obj1 <- obj
exprs(obj1) <- log(exprs(obj1))
identical(exprs(obj1),exprs(obj)) # FALSE: good!

Note that attempts to directly change slots in locked environments
cause an error
Error: cannot change value of a locked binding. 

The setReplaceMethod for exprs (and assayData) succeeds by performing
a deep copy of the entire environment. Becaue this is very
inefficient, the recommended paradigm to update an element in a
lockedEnvironment is to extract it, make many changes, and then
reassign it, e.g.,

ex <- exprs(obj1)
# many changes, ex <- log(ex), ...
exprs(obj1) <- ex

lockedEnvironment offers some efficiency in copying objects, because
the environment is not copied during function calls. This is not
completely satisfactory, though

func <- function(assayData) # good: contents of env will not be copied
  max(exprs(assayData)) # not so good: exprs copied from environment


3. Changes in other slots

Other slots have been changed to treat variable metadata more
efficiently (in the AnnotatedDataFrame class of slot phenoData) and to
simplify the type of data stored as experimentData. These changes are
mostly in line with the web discussions.



In making these changes, I have tried not to break the existing
interface beyond what is necessary for the new functionality (e.g.,
pData still returns the 'data' part of phenoData). One difference,
though, is that the methods dim, ncol, etc return a vector of
dimensions reflecting the shared dimensionality of the assayData
memebers; dims returns an array of dimensions of each element.

These changes affect eSets; any difficulties you might have with
exprSet probably reflect changes made several months ago to validity
checking.

Please let me know of any feedback,

Martin
--

The original 'sample.eSet' contains four elements in the assayData
slot: R, G, Rb, Gb. To derive a class from eSet for this data, create
a class, and provide initializaation and validation
methods. Optionally, update previous eSet data structures to your new
class. For instance,

setClass("SwirlSet", contains="eSet")

setMethod("initialize", "SwirlSet",
          function(.Object,
                   phenoData = new("AnnotatedDataFrame"),
                   experimentData = new("MIAME"),
                   annotation = character(),
                   R = new("matrix"),
                   G = new("matrix"),
                   Rb = new("matrix"),
                   Gb = new("matrix"),
                   ... ) {
            callNextMethod(.Object,
                           assayData = assayDataNew(
                             R=R, G=G, Rb=Rb, Gb=Gb,
                             ...),
                           phenoData = phenoData,
                           experimentData = experimentData,
                           annotation = annotation)
          })

setValidity("SwirlSet", function(object) {
  assayDataValidMembers(assayData(object), c("R", "G", "Rb", "Gb"))
})

data(sample.eSet)
obj <- updateOldESet(sample.eSet,"SwirlSet")
#
Hi!

Looks good.

One comment:
Typically, SnpSet will have more than just the allele calls and p-values.
It will aslo have a copy number estimate and it's measure of 
uncertainty. How hard is it to add these?
You would need one more array  just like the one you used for the 
genotype calls.

-r
Martin Morgan wrote:

            
#
Hi Rafael,

An approach is to create a new class derived from eSet with
initialization and validation methods to specify the additional
elements to be stored in assayData. For the additions you suggest, the
following might be a useful template:

setClass("SnpSetPlus", contains="eSet")

setMethod("initialize", "SnpSetPlus",
          function(.Object,
                   phenoData = new("AnnotatedDataFrame"),
                   experimentData = new("MIAME"),
                   annotation = character(),
                   call = new("matrix"),
                   callProbability = new("matrix"),
                   copyNumber = new("matrix"),
                   copyNumberProbability = new("matrix"),
                   ... ) {
            callNextMethod(.Object,
                           phenoData = phenoData,
                           experimentData = experimentData,
                           annotation = annotation,
                           call=call,
                           callProbability=callProbability,
                           copyNumber=copyNumber,
                           copyNumberProbability=copyNumberProbability,
                           ...)
          })

setValidity("SnpSetPlus", function(object) {
  assayDataValidMembers(assayData(object),
                        c("call","callProbability",
                          "copyNumber", "copyNumberProbability"))
})

obj <- new("SnpSetPlus")

Martin


Rafael A Irizarry <ririzarr at jhsph.edu> writes:
#
but most applications will have both genotype call and
copy number estimates. so why not just make SnpSet like SnpSetPlus below?

Notice that in our software i will only use SpnSetPlus and so will Rob.

-r
On Thu, 13 Apr 2006, Martin Morgan wrote:

            
4 days later
#
Martin,

Sorry for taking so long in responding.

We seem to be almost there but I have two concerns:

1) SNPset as it is will not be useful. As far as I know only Rob and me 
are  developing software that will use snpSet. Both of us need a slot 
for copynumber. Otherwise we will need to create a new class, which will 
be the only one used, and we wont get to use the name SnpSet.

2) It seems that in general we will be storing  an esimate (expression, 
calls, copynumber) and some kind of measure of uncertainty (SE for 
expression, p-value fo calls, etc..). However, a big chunk fo the apps 
will not use uncertainty. It would be a shame to have to store a matrix 
of NAs every single time. How hard would be to have eSet take NULL for 
some matrices? The validity check can look at everything that is not NULL.
Notice that the alternative is to define  a new class which means, in my 
case, Ill need two classes for every class Im defining or having a 
matrix of NAs, which, given the size of data these days, will be very 
inneficient.

im cc-ing ingo and rob who maintain SNPscan.

-r
Martin Morgan wrote:

            
#
Hi Rafa,

I'm going to answer out of order...

Rafael A Irizarry <ririzarr at jhsph.edu> writes:
I think we should define an EmptyMatrix class:

    setClass("EmptyMatrix", contains="matrix")

Why?  Because:

 * new("EmptyMatrix") is small (no elements) 
 * We can dispatch on it.  I think we might be able to get some
   propagation of empty similar to how NA works.  This can keep the
   code from having to do lots of explicit type checking.
 * NULL can happen by accident and should be an error.  EmptyMatrix
   won't just "appear" from a calculation.
 * It will also mean that these values could go in a proper slot of
   type "matrix" without having to create the matrixOrNull class
   union.
There is something to be said for reserving the general class names
for things that will be general.  Perhaps it makes sense to choose a
slightly more specific name while sorting out the all-purpose use
cases.

Your RafaSnpSet can become SnpSet.  But if you start with SnpSet and
go off in a direction that is not useful to others, then there is no
easy fix.



+ seth
#
Hi Seth,


I like the idea of "EmptyMatrix".
Another more pedestrian option would be "matrix" x with
   all(dim(x)==0)
such as e.g. the result of new("matrix").

The only (small) advantage is that new users/developers that know R but 
not bioc might find it less intimidating if our designs are not overly 
verbose.

I don't mind much, just a suggestion

Cheers
   Wolfgang
#
Wolfgang Huber <huber at ebi.ac.uk> writes:
That would work, but will always require the explicit check.  With
EmptyMatrix we can dispatch.  I'm not certain that will be useful, but
I think it could be...
Certainly a consideration.
#
I'll implement EmptyMatrix. I will then update validity checking (and
other methods, as necessary) to allow for it in place of any/all
elements in a class. This will take a couple of days to get around to,
and in the mean time I'll talk with Robert and Seth about what
elements are actually in SnpSet; in some ways the EmptyMatrix idea
makes a more flexible approach (specifying several required slots,
e.g., expression, call, copyNumber, but usually functioning correctly
if these are occupied by EmptyMatrix) appealing.

It might help for a two-second overview of how 'expression' data might
be generally useful; I can see how call and copyNumber will feed into
'downstream' analyses.

Martin

Seth Falcon <sfalcon at fhcrc.org> writes:
#
Martin and Seth,

Thanks for this!

Regarding SnpSet, all applications ive seen are either genotype calling, 
copy number estimation or both. so i do think that the "minimum" SnpSet 
should contain 4 matrices. As you point out, we can always leave some 
of the matrices empty. We could also define: subclasses, e.g., 
GenoTypeSnpSet and CopyNumberSnpSet.

Once you are done let us know and we will give you feedback within a day.

-r
On Tue, 18 Apr 2006, Martin Morgan wrote: