[Bioc-devel] eset.Rnw revised in Biobase, please review

Hi Kasper,
Hi Vince and others

Below is my first thoughts about the eSet class. I must say that I  
like small "tight" classes with a strong validity checking.

I will start with some specific comments:

1) The history slot: a reasonable idea. But if we have a specific  
history slot, shouldn't it be filled automatically every time an eSet  
is created or modified. That is, every replacement function or  
initialization should update this slot. Otherwise I do not really see  
the need to keep this slot separate from the notes.
I doubt that such a comprehensive approach will be useful, especially 
since we do not yet have a markup, or intended mechanism for display or 
managing the history mechanism. I suspect that at least initially less 
is going to be more helpful. Perhaps tracking changes to the 
expressions, or a few other slots would be a good first cut.
2) The dim method: since it is part of your validity checking that  
every component of the assayData slot has the same dimensions, there  
is no need to have the dim be a matrix (every column will by  
definition be the same). You need an internal method to extract the  
matrix of dimensions, in order to do the validity checking of course...

Vince answered this - we are not yet sure that they would be, and 
would appreciate examples where they are not.
3) I like the idea of having reportNames separate from the assayData.  
That also means that the  names do not need to be unique. But shoudl  
sampleNames be a separate slot or just be the rownames of the  
phenoData slot? These should be some kind of checking that the length  
of these names or either 0 (no names given) or equal to the number of  
samples/reporters.
I think that these should be checked in many different ways. Any 
place that they can be assigned they should be scrutinized and if 
present we should check that they are the same, and in the same order as 
those in the phenoData (whether row names on the dataframe or in a 
special slot).
4) I think the class of reporterInfor (data.frameOrNULL) is a bit too  
strict. You give a compelling reason that we might want to give a  
control/active factor. Now, since the number of reporters are huge,  
this slot will (if not empty) be a very big structure, so I think we  
really want to allow a very specific usage of this kind of slot  
(data.frames are not terrible efficient). I would like the option of  
having it be either a factor, an integer or a matrix. A possible use  
scenario (which I strongly advocate) would be the use of an integer  
to indicate (x,y) position on the chip for AffyBatch-like objects  
(right now the map between row and (x,y) position in the AffyBatch  
object is implicit which does not allow for subsetting of the object,  
since that would break the link).
I don't see the inefficiencies you are mentioning? A data.frame is 
merely a list of vectors and since I don't think we will solve all 
problems with a single vector of reporterInfo then data.frame is the 
natural data structure. If you have some other data indicating specifice 
inefficiencies please provide it. Your example, and others, are what we 
had in mind.
Also, if someone wants to do splitting or the assayData based on a  
factor, it may be _way_ more efficient to have the split done once  
and for all (I imagine assayDataControl, assayDataActive) (something  
which btw is not really doable in the current setup since the two  
structures would have different dimensions), instead of using a  
factor to the split "every time". Hmm. I haven't really thought this  
through.
Not sure what you are worried about here, but we do envisage some 
general uses of splitting parts, or all of eSets via different variables 
that are being made available. Again, it is probably best to see what 
the real usage patterns are before we commit to the implementation.
5) I am not really in favour of the varMetadata slot of the phenoData  
class, although the vignette seems to indicate that this was included  
in Bioc 1.6. The only example you include is the specification of  
units, something I feel belong in the varLabels slot such as  
"specimen age, in years". As I currently understand it, I feel this  
is a bit too much annotation. The same goes for a hypothetical  
reporterMetadata slot. Perhaps you have another usage in mind? There  
does not seem to be validity checking of this slot?

I don't see how you could every realistically parse a label and get 
back what you want (or even know, in some programmatic way that there is 
valuable information there), your experience may be different.
6) the assayData slot: I do not really understand the pass-by- 
reference comments you make in the vignette, but they seem to  
indicate that there would be performance gains to using an  
environment. Could you explain this in some more detail. And if there  
is, I see no reason to allow a list type structure. I think it should  
be mandatory to have either a list or an environment, allowing both  
just adds confusion. I would rather have the community choose the  
most efficient way and then "force" developers to use this.

We try not to force much of anything onto developers. Lists and 
environments are essentially equivalent here, and there is probably no 
need to impose one or the other. Users/developers need to store things 
together and to access them by name - lists and environments both 
provide that capability. If you, or someone else, wants to do some 
careful time and space comparisons, we would certainly take that under 
advisement, but for now, we think we have the resources to get this new 
data structure in place for the next release.
7) So the assayData slot does not have a specific number/names for  
its components. I see the need for this. But let us say I want to use  
it for a specific case where I have two assays (let us say a two- 
color micro array experiment). Do you imagine that people will create  
more specific versions of the class by something like (code not tested)
   setClass("twoclor", representation("eSet"),
      validity = function(object){
         if(!validObject(as(object, "eSet")
            return(FALSE)  ## this might be unnecessary
         if(sort(names(assayData(object)) != c("green", "red"))
            return(FALSE)
         else
           return(TRUE)
       })
or how do users actually make sure that the elements of the assayData  
have the relevant names (and numbers)?
That would be one use, Martin already pointed out one set of 
problems, let me suggest that the need to sort seems wrong, as does the 
notion that only red and green are valid names ( %in%, toupper, and a 
few other functions might make any user of such a class much happier). 
You probably also want to run the eSet validity checker.

   Thanks again for all the comments,
     Robert
Kasper

On Sep 2, 2005, at 9:26 AM, Vincent Carey 525-2265 wrote:

We need discussion of the eSet class, which is to take the place
of exprSet in the future.  eset.Rnw in Biobase/inst/doc has
been revised.  Please review and discuss.

you will need R 2.2 and the latest Biobase to build this vignette.

vc

_______________________________________________
Bioc-devel at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 981029-1024
206-667-7700
rgentlem at fhcrc.org

[Bioc-devel] eset.Rnw revised in Biobase, please review

Thread (6 messages)