Skip to content

[Bioc-devel] Couple of eSet questions

10 messages · Sean Davis, Seth Falcon, Vincent Carey +1 more

#
Since we are talking about eSets, I thought I would ask a couple more
questions.

1)  What is the thinking as to "standard use" for two-color data and eSets?

2)  With regard to reporterInfo, is there going to be a constraint to keep
the single-column data.frame idea as presented in the vignette, or is that
slot meant to be more flexible (left as a generic data.frame)?

I ask here because I get the sense that these things are developing, still.
If I am wrong, I can certainly redirect to the Bioc list.

Thanks,
Sean
#
On 2 Feb 2006, sdavis2 at mail.nih.gov wrote:
This is a good forum for the discussion.
I would like to see the reporterInfo be an instance of a class much
like the phenoData in that it is data.frame like but also has some
description of what is in it.  

So to me, the phenoData class might be better named AnnotatedDataFrame
and then both the phenoData and reporterInfo slots can have one of
these.

+ seth
#
More thoughts on eSet...
On 2 Feb 2006, sdavis2 at mail.nih.gov wrote:
That's a good question.  For me it points to a downside of the notion
of an "everything set".  In its currnet form, "standard use" for
two-color data would require a convention for how to name the
assayData elements.

I think the notion of grouping together assay data that shares
reporterInfo and phenoData is a good one, but I wonder if the actual
data storage/access is best handled by technology specific subclasses.

Here are three usage scenarios:

1. Two-color data.  Want to store two expression matrices for red and
   green scans along with an associated standard error matrix for
   each.  

2. Time-course data.  Same samples, same chip.  Want to store
   timepoint expression matrix pairs.

3. Combine 1 & 2.

While I can imagine ways to use eSet for all three, they would all require
ad-hoc naming schemes for the elements of the assayData slot.  Here,
there may be a real advantage to a technology-specific subclass that
defines an appropriate structure (redExprs, greenExprs, etc).

Another angle is to search for methods that make sense across
different technologies:

  description()
  notes()
  annotation()
  history()?
  phenoData()
  reporterInfo()
  sampleNames()
  reporternames()
Seems to me that reporterInfo and phenoData slots should contain the
same class, a data.frame with some additional meta data.

+ seth
#
On 2/3/06 1:36 PM, "Seth Falcon" <sfalcon at fhcrc.org> wrote:

            
What about a "channel" data structure which would include the label used
(Cy3, Cy5, etc.), of which there can be 1..2?

I still think there needs to be an accepted data matrix and associated error
matrix for the whole dataset, though.  This would represent the normalized
and processed data to be used in any further analysis.
Time course data needs to be combined with metadata in order to be useful.
I'm not sure that two-color data and time-course data are analogous from
that point of view, but I may just be thinking about it slightly
differently.

Sean
#
On 3 Feb 2006, sdavis2 at mail.nih.gov wrote:
I was making up the specifics and it shows ;-)
Where I think we agree, is that technology specific data structures
might be best represented as part of the class structure not left to
an ad-hoc how to use list names setup.
Right, this was supposed to be not-analogous apart from wanting to
have a single place to put the metadata.  

+ seth
#
it sounds to me as if we are finding some use cases, and that
we want to extend eSet to cope with those that have well-defined
requirements.  all we know about assayData at present is that
it is either a list or an environment.  for twochannel data
we may want to define a specific extension of listOrEnv (i think
this is possible) that has guaranteed structure and names.
for data with standard errors we might have to do likewise.

but i would say that part of the intention of the eSet is to
require the developer always to allow an environment representation
for the assayData.  the validity criteria can impose restrictions
on this environment.

eSet itself does not need to solve the two channel or error-available
problems at once.  it should be extended to do so, with explicit
use cases stated.

---
Vince Carey, PhD
Assoc. Prof Med (Biostatistics)
Harvard Medical School
Channing Laboratory - ph 6175252265 fa 6177311541
181 Longwood Ave Boston MA 02115 USA
stvjc at channing.harvard.edu
On Fri, 3 Feb 2006, Seth Falcon wrote:

            
#
On 3 Feb 2006, stvjc at channing.harvard.edu wrote:
Respectfully, I think I disagree.  I would like to have the use cases
drive the design of specific subclasses of an eSet-like class where
the structure is expressed as part of the class definition
(e.g. exprSet should have an exprs slot, not a named element of an
env).

Pushing the definition of the structure to the validity function makes
the actual structure harder to see (IMO) and I'm concerned that it
will make extensions which otherwise would be trivial subclasses,
tricky.  I guess a part of my objection is that it feels as though we
will be implementing our own mini class system where slots are the
named elements of an env.

A compelling argument for forcing the actual data to be in an
environment is to avoid copying.
Yes, I'm just not convinced that eSet has any business having actual
data slots; those are the domain of its subclasses.

Putting aside my perhaps ideological objections, maybe a compromise is
to work on some of the concrete subclasses (two-channel data being one
good example) and factor out the common elements as the evolve.  

+ seth
#
I think the use case should be used to describe what behaviors
we want and then the representation can be chosen to allow
those behaviors.

The internal representation should be subordinate
to the methods that expose the structure through its behaviors.
Now perhaps my comment on requiring the developer to allow for
an environment is inconsistent with this position.
I am open to discussion of how the visibility of a structure needs
to be cared for in our project.  If it is only visible through
methods, it shouldn't matter whether the information components are
slots or environment elements.  And the class designers should
be free to change the internal representations without downstream
consequences.  We have not achieved this in several domains --
should we be trying harder?
That's the intention -- but not to force people to use environments,
but to allow them to do so when it makes sense to do so.
It would be good to get some agreement on this.  I would say that
the eSet schematizes high throughput assay data.  We hold, to
some benefit, that there needs to be an assayData component, and that
it has reporterInfo and phenoData by virtue of its extension of
annotatedDataset.  Less than this and we are not solving the
high-throughput problem.  It has some other slots that I am not
so sure about that seem to be there for continuity with the
previous incarnation.
i agree.  i don't mean to be very pedantic about the representation/
method access concepts ... i have to run now.
2 days later
#
On 2/3/06 4:05 PM, "Vincent Carey 525-2265" <stvjc at channing.harvard.edu>
wrote:
I'm just curious as to when and by whom eSets will be subclassed?  Will that
be left to individual package developers, or will these subclasses be
available in biobase for general consumption?

Sean
#
Sean Davis wrote:
Ideally there would be platform/experimentt specific subclasses. 
Initially I would expect them all to be different, but overtime for 
consensus to appear. For example, we would be really supportive of some 
move on the arrayCGH front that went for a standardized data format (and 
some form of eSet could be used). For flow cytometry the prada package 
has already done something along these lines and we make some use of it 
in rflowcyt.

   I doubt it would all go in Biobase, but rather that there might be a 
aCGHBase and a flowBase etc that would help to accommodate different groups.

   Whether that happens depends a lot on those folks. We try pretty hard 
not to enforce any particular style but rather to show that there are 
benefits from using common formats and to having the data be as 
self-describing as possible.

Hopefully, once we make some progress on the microarray experiment 
repository more people will see the benefits of this approach, but maybe 
not.

  best wishes
    Robert