[Bioc-devel] eset.Rnw revised in Biobase, please review
Hi Kasper,
Kasper Daniel Hansen wrote:
Hi Vince and others Below is my first thoughts about the eSet class. I must say that I like small "tight" classes with a strong validity checking. I will start with some specific comments: 1) The history slot: a reasonable idea. But if we have a specific history slot, shouldn't it be filled automatically every time an eSet is created or modified. That is, every replacement function or initialization should update this slot. Otherwise I do not really see the need to keep this slot separate from the notes.
I doubt that such a comprehensive approach will be useful, especially since we do not yet have a markup, or intended mechanism for display or managing the history mechanism. I suspect that at least initially less is going to be more helpful. Perhaps tracking changes to the expressions, or a few other slots would be a good first cut.
2) The dim method: since it is part of your validity checking that every component of the assayData slot has the same dimensions, there is no need to have the dim be a matrix (every column will by definition be the same). You need an internal method to extract the matrix of dimensions, in order to do the validity checking of course...
Vince answered this - we are not yet sure that they would be, and would appreciate examples where they are not.
3) I like the idea of having reportNames separate from the assayData. That also means that the names do not need to be unique. But shoudl sampleNames be a separate slot or just be the rownames of the phenoData slot? These should be some kind of checking that the length of these names or either 0 (no names given) or equal to the number of samples/reporters.
I think that these should be checked in many different ways. Any place that they can be assigned they should be scrutinized and if present we should check that they are the same, and in the same order as those in the phenoData (whether row names on the dataframe or in a special slot).
4) I think the class of reporterInfor (data.frameOrNULL) is a bit too strict. You give a compelling reason that we might want to give a control/active factor. Now, since the number of reporters are huge, this slot will (if not empty) be a very big structure, so I think we really want to allow a very specific usage of this kind of slot (data.frames are not terrible efficient). I would like the option of having it be either a factor, an integer or a matrix. A possible use scenario (which I strongly advocate) would be the use of an integer to indicate (x,y) position on the chip for AffyBatch-like objects (right now the map between row and (x,y) position in the AffyBatch object is implicit which does not allow for subsetting of the object, since that would break the link).
I don't see the inefficiencies you are mentioning? A data.frame is merely a list of vectors and since I don't think we will solve all problems with a single vector of reporterInfo then data.frame is the natural data structure. If you have some other data indicating specifice inefficiencies please provide it. Your example, and others, are what we had in mind.
Also, if someone wants to do splitting or the assayData based on a factor, it may be _way_ more efficient to have the split done once and for all (I imagine assayDataControl, assayDataActive) (something which btw is not really doable in the current setup since the two structures would have different dimensions), instead of using a factor to the split "every time". Hmm. I haven't really thought this through.
Not sure what you are worried about here, but we do envisage some general uses of splitting parts, or all of eSets via different variables that are being made available. Again, it is probably best to see what the real usage patterns are before we commit to the implementation.
5) I am not really in favour of the varMetadata slot of the phenoData class, although the vignette seems to indicate that this was included in Bioc 1.6. The only example you include is the specification of units, something I feel belong in the varLabels slot such as "specimen age, in years". As I currently understand it, I feel this is a bit too much annotation. The same goes for a hypothetical reporterMetadata slot. Perhaps you have another usage in mind? There does not seem to be validity checking of this slot?
I don't see how you could every realistically parse a label and get back what you want (or even know, in some programmatic way that there is valuable information there), your experience may be different.
6) the assayData slot: I do not really understand the pass-by- reference comments you make in the vignette, but they seem to indicate that there would be performance gains to using an environment. Could you explain this in some more detail. And if there is, I see no reason to allow a list type structure. I think it should be mandatory to have either a list or an environment, allowing both just adds confusion. I would rather have the community choose the most efficient way and then "force" developers to use this.
We try not to force much of anything onto developers. Lists and environments are essentially equivalent here, and there is probably no need to impose one or the other. Users/developers need to store things together and to access them by name - lists and environments both provide that capability. If you, or someone else, wants to do some careful time and space comparisons, we would certainly take that under advisement, but for now, we think we have the resources to get this new data structure in place for the next release.
7) So the assayData slot does not have a specific number/names for
its components. I see the need for this. But let us say I want to use
it for a specific case where I have two assays (let us say a two-
color micro array experiment). Do you imagine that people will create
more specific versions of the class by something like (code not tested)
setClass("twoclor", representation("eSet"),
validity = function(object){
if(!validObject(as(object, "eSet")
return(FALSE) ## this might be unnecessary
if(sort(names(assayData(object)) != c("green", "red"))
return(FALSE)
else
return(TRUE)
})
or how do users actually make sure that the elements of the assayData
have the relevant names (and numbers)?
That would be one use, Martin already pointed out one set of
problems, let me suggest that the need to sort seems wrong, as does the
notion that only red and green are valid names ( %in%, toupper, and a
few other functions might make any user of such a class much happier).
You probably also want to run the eSet validity checker.
Thanks again for all the comments,
Robert
Kasper On Sep 2, 2005, at 9:26 AM, Vincent Carey 525-2265 wrote:
We need discussion of the eSet class, which is to take the place of exprSet in the future. eset.Rnw in Biobase/inst/doc has been revised. Please review and discuss. you will need R 2.2 and the latest Biobase to build this vignette. vc
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 981029-1024 206-667-7700 rgentlem at fhcrc.org