Skip to content

[Bioc-devel] eset.Rnw revised in Biobase, please review

6 messages · Vincent Carey, Kasper Daniel Hansen, Martin Maechler +2 more

#
We need discussion of the eSet class, which is to take the place
of exprSet in the future.  eset.Rnw in Biobase/inst/doc has
been revised.  Please review and discuss.

you will need R 2.2 and the latest Biobase to build this vignette.

vc
3 days later
#
Hi Vince and others

Below is my first thoughts about the eSet class. I must say that I  
like small "tight" classes with a strong validity checking.

I will start with some specific comments:

1) The history slot: a reasonable idea. But if we have a specific  
history slot, shouldn't it be filled automatically every time an eSet  
is created or modified. That is, every replacement function or  
initialization should update this slot. Otherwise I do not really see  
the need to keep this slot separate from the notes.

2) The dim method: since it is part of your validity checking that  
every component of the assayData slot has the same dimensions, there  
is no need to have the dim be a matrix (every column will by  
definition be the same). You need an internal method to extract the  
matrix of dimensions, in order to do the validity checking of course...

3) I like the idea of having reportNames separate from the assayData.  
That also means that the  names do not need to be unique. But shoudl  
sampleNames be a separate slot or just be the rownames of the  
phenoData slot? These should be some kind of checking that the length  
of these names or either 0 (no names given) or equal to the number of  
samples/reporters.

4) I think the class of reporterInfor (data.frameOrNULL) is a bit too  
strict. You give a compelling reason that we might want to give a  
control/active factor. Now, since the number of reporters are huge,  
this slot will (if not empty) be a very big structure, so I think we  
really want to allow a very specific usage of this kind of slot  
(data.frames are not terrible efficient). I would like the option of  
having it be either a factor, an integer or a matrix. A possible use  
scenario (which I strongly advocate) would be the use of an integer  
to indicate (x,y) position on the chip for AffyBatch-like objects  
(right now the map between row and (x,y) position in the AffyBatch  
object is implicit which does not allow for subsetting of the object,  
since that would break the link).

Also, if someone wants to do splitting or the assayData based on a  
factor, it may be _way_ more efficient to have the split done once  
and for all (I imagine assayDataControl, assayDataActive) (something  
which btw is not really doable in the current setup since the two  
structures would have different dimensions), instead of using a  
factor to the split "every time". Hmm. I haven't really thought this  
through.

5) I am not really in favour of the varMetadata slot of the phenoData  
class, although the vignette seems to indicate that this was included  
in Bioc 1.6. The only example you include is the specification of  
units, something I feel belong in the varLabels slot such as  
"specimen age, in years". As I currently understand it, I feel this  
is a bit too much annotation. The same goes for a hypothetical  
reporterMetadata slot. Perhaps you have another usage in mind? There  
does not seem to be validity checking of this slot?

6) the assayData slot: I do not really understand the pass-by- 
reference comments you make in the vignette, but they seem to  
indicate that there would be performance gains to using an  
environment. Could you explain this in some more detail. And if there  
is, I see no reason to allow a list type structure. I think it should  
be mandatory to have either a list or an environment, allowing both  
just adds confusion. I would rather have the community choose the  
most efficient way and then "force" developers to use this.

7) So the assayData slot does not have a specific number/names for  
its components. I see the need for this. But let us say I want to use  
it for a specific case where I have two assays (let us say a two- 
color micro array experiment). Do you imagine that people will create  
more specific versions of the class by something like (code not tested)
   setClass("twoclor", representation("eSet"),
      validity = function(object){
         if(!validObject(as(object, "eSet")
            return(FALSE)  ## this might be unnecessary
         if(sort(names(assayData(object)) != c("green", "red"))
            return(FALSE)
         else
           return(TRUE)
       })
or how do users actually make sure that the elements of the assayData  
have the relevant names (and numbers)?

Kasper
On Sep 2, 2005, at 9:26 AM, Vincent Carey 525-2265 wrote:

            
#
RG is working on the history concept now so I will pass on this.
good point.  i am hoping to hear from folks whether they can
imagine situations in which the assayData components may have
different dimensions.  in that case the validity check would have
to be relaxed.
i have vacillated on this aspect of metadata.  currently i believe
that rownames and colnames should be supplied and that the reporterNames
must come from there.  we now have the reporterData data.frame in there
(in annotatedDataset) that can ameliorate the problem of requiring unique
reporterNames
is the data.frame as a container of a factor really an efficiency
loss?
we do need to think through the split use cases.  example, we would
like to make it easy for people to compute a normalization function
based strictly on control spots.
right, no  validity checking yet.  you are right that such metadata
could be contained in labels, but how do you compute on those labels?
if you have a few datasets and need to make years and months variables
compatible, a convention on a units method may be helpful.  we have
one vote (private, a long time ago) in favor of the varMetadata approach and
now one against.
environments are not copied when passed to functions.  everything
else is, afaik.  why not require environments?  it is open for
additional discussion
conceptually i think this is right.  we want to make sure the basic
infrastructure is not missing anything that you would want to have
in COMMON to all the different extensions that one can anticipate
for high throughput platforms.
#
.................

    Kasper> 7) So the assayData slot does not have a specific
    Kasper> number/names for its components. I see the need for
    Kasper> this. But let us say I want to use it for a specific
    Kasper> case where I have two assays (let us say a two-
    Kasper> color micro array experiment). Do you imagine that
    Kasper> people will create more specific versions of the
    Kasper> class by something like (code not tested)

(yes, it was missing 3 closing ")" 
  --- quickly seen when using Emacs with "paren match" activated)

    >>    setClass("twoclor", representation("eSet"),
    >>       validity = function(object){
    >>          if(!validObject(as(object, "eSet")))
    >>             return(FALSE)  ## this might be unnecessary
    >>          if(sort(names(assayData(object)) != c("green", "red")))
    >>             return(FALSE)
    >>          else
    >>            return(TRUE)
    >>        })

I want to comment on the above code, since
I think I've seen the same mistake several times in people's code:

Validity checking should  **NOT** return TRUE or FALSE,
but  TRUE or <reason for non-validity> .
This has been in `The Green Book' but also the very first entry in
   ?SetValidity
or ?validObject :
Martin
#
Hi Kasper,
Kasper Daniel Hansen wrote:
I doubt that such a comprehensive approach will be useful, especially 
since we do not yet have a markup, or intended mechanism for display or 
managing the history mechanism. I suspect that at least initially less 
is going to be more helpful. Perhaps tracking changes to the 
expressions, or a few other slots would be a good first cut.
Vince answered this - we are not yet sure that they would be, and 
would appreciate examples where they are not.
I think that these should be checked in many different ways. Any 
place that they can be assigned they should be scrutinized and if 
present we should check that they are the same, and in the same order as 
those in the phenoData (whether row names on the dataframe or in a 
special slot).
I don't see the inefficiencies you are mentioning? A data.frame is 
merely a list of vectors and since I don't think we will solve all 
problems with a single vector of reporterInfo then data.frame is the 
natural data structure. If you have some other data indicating specifice 
inefficiencies please provide it. Your example, and others, are what we 
had in mind.
Not sure what you are worried about here, but we do envisage some 
general uses of splitting parts, or all of eSets via different variables 
that are being made available. Again, it is probably best to see what 
the real usage patterns are before we commit to the implementation.
I don't see how you could every realistically parse a label and get 
back what you want (or even know, in some programmatic way that there is 
valuable information there), your experience may be different.
We try not to force much of anything onto developers. Lists and 
environments are essentially equivalent here, and there is probably no 
need to impose one or the other. Users/developers need to store things 
together and to access them by name - lists and environments both 
provide that capability. If you, or someone else, wants to do some 
careful time and space comparisons, we would certainly take that under 
advisement, but for now, we think we have the resources to get this new 
data structure in place for the next release.
That would be one use, Martin already pointed out one set of 
problems, let me suggest that the need to sort seems wrong, as does the 
notion that only red and green are valid names ( %in%, toupper, and a 
few other functions might make any user of such a class much happier). 
You probably also want to run the eSet validity checker.

   Thanks again for all the comments,
     Robert

  
    
#
Hi,

I think the history idea is extremely important (and useful). We analyze a 
lot of different microarray experiments. As a consequence, we often find 
ourselves in the position of having to go back and figure out exactly how a 
particular data set was analyzed when we are writing the methods section 
for the resulting article or grant proposal. Our experience is that the 
scripts that supposedly performed the analysis don't always match the 
objects that were saved, and so we have started implementing a 
comprehensive history mechanism that does update the history slot every 
time the object is created or modified. You can take a look at the 
implementation at
	http://bioinformatics.mdanderson.org/Software/OOMPA
To get this to work, however, we've had to turn processing functions into 
objects that know how to update the history slot. The benefit is that you 
can just ask the object what it's history is (or even put it into the 
summary method).

We're in the process of converting the various pieces of "expresso" into 
"Processor" and "Pipeline" objects so we can handle affymetrix arrays 
seamlessly, but the code isn't yet in a form that is suitable for public 
consumption.

Best,
	Kevin

--On Tuesday, September 06, 2005 7:19 AM -0700 Robert Gentleman
<rgentlem at fhcrc.org> wrote: