We need discussion of the eSet class, which is to take the place of exprSet in the future. eset.Rnw in Biobase/inst/doc has been revised. Please review and discuss. you will need R 2.2 and the latest Biobase to build this vignette. vc
[Bioc-devel] eset.Rnw revised in Biobase, please review
6 messages · Vincent Carey, Kasper Daniel Hansen, Martin Maechler +2 more
3 days later
Hi Vince and others
Below is my first thoughts about the eSet class. I must say that I
like small "tight" classes with a strong validity checking.
I will start with some specific comments:
1) The history slot: a reasonable idea. But if we have a specific
history slot, shouldn't it be filled automatically every time an eSet
is created or modified. That is, every replacement function or
initialization should update this slot. Otherwise I do not really see
the need to keep this slot separate from the notes.
2) The dim method: since it is part of your validity checking that
every component of the assayData slot has the same dimensions, there
is no need to have the dim be a matrix (every column will by
definition be the same). You need an internal method to extract the
matrix of dimensions, in order to do the validity checking of course...
3) I like the idea of having reportNames separate from the assayData.
That also means that the names do not need to be unique. But shoudl
sampleNames be a separate slot or just be the rownames of the
phenoData slot? These should be some kind of checking that the length
of these names or either 0 (no names given) or equal to the number of
samples/reporters.
4) I think the class of reporterInfor (data.frameOrNULL) is a bit too
strict. You give a compelling reason that we might want to give a
control/active factor. Now, since the number of reporters are huge,
this slot will (if not empty) be a very big structure, so I think we
really want to allow a very specific usage of this kind of slot
(data.frames are not terrible efficient). I would like the option of
having it be either a factor, an integer or a matrix. A possible use
scenario (which I strongly advocate) would be the use of an integer
to indicate (x,y) position on the chip for AffyBatch-like objects
(right now the map between row and (x,y) position in the AffyBatch
object is implicit which does not allow for subsetting of the object,
since that would break the link).
Also, if someone wants to do splitting or the assayData based on a
factor, it may be _way_ more efficient to have the split done once
and for all (I imagine assayDataControl, assayDataActive) (something
which btw is not really doable in the current setup since the two
structures would have different dimensions), instead of using a
factor to the split "every time". Hmm. I haven't really thought this
through.
5) I am not really in favour of the varMetadata slot of the phenoData
class, although the vignette seems to indicate that this was included
in Bioc 1.6. The only example you include is the specification of
units, something I feel belong in the varLabels slot such as
"specimen age, in years". As I currently understand it, I feel this
is a bit too much annotation. The same goes for a hypothetical
reporterMetadata slot. Perhaps you have another usage in mind? There
does not seem to be validity checking of this slot?
6) the assayData slot: I do not really understand the pass-by-
reference comments you make in the vignette, but they seem to
indicate that there would be performance gains to using an
environment. Could you explain this in some more detail. And if there
is, I see no reason to allow a list type structure. I think it should
be mandatory to have either a list or an environment, allowing both
just adds confusion. I would rather have the community choose the
most efficient way and then "force" developers to use this.
7) So the assayData slot does not have a specific number/names for
its components. I see the need for this. But let us say I want to use
it for a specific case where I have two assays (let us say a two-
color micro array experiment). Do you imagine that people will create
more specific versions of the class by something like (code not tested)
setClass("twoclor", representation("eSet"),
validity = function(object){
if(!validObject(as(object, "eSet")
return(FALSE) ## this might be unnecessary
if(sort(names(assayData(object)) != c("green", "red"))
return(FALSE)
else
return(TRUE)
})
or how do users actually make sure that the elements of the assayData
have the relevant names (and numbers)?
Kasper
On Sep 2, 2005, at 9:26 AM, Vincent Carey 525-2265 wrote:
We need discussion of the eSet class, which is to take the place of exprSet in the future. eset.Rnw in Biobase/inst/doc has been revised. Please review and discuss. you will need R 2.2 and the latest Biobase to build this vignette. vc
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Hi Vince and others Below is my first thoughts about the eSet class. I must say that I like small "tight" classes with a strong validity checking. I will start with some specific comments: 1) The history slot: a reasonable idea. But if we have a specific history slot, shouldn't it be filled automatically every time an eSet is created or modified. That is, every replacement function or initialization should update this slot. Otherwise I do not really see the need to keep this slot separate from the notes.
RG is working on the history concept now so I will pass on this.
2) The dim method: since it is part of your validity checking that every component of the assayData slot has the same dimensions, there is no need to have the dim be a matrix (every column will by definition be the same). You need an internal method to extract the matrix of dimensions, in order to do the validity checking of course...
good point. i am hoping to hear from folks whether they can imagine situations in which the assayData components may have different dimensions. in that case the validity check would have to be relaxed.
3) I like the idea of having reportNames separate from the assayData. That also means that the names do not need to be unique. But shoudl sampleNames be a separate slot or just be the rownames of the phenoData slot? These should be some kind of checking that the length of these names or either 0 (no names given) or equal to the number of samples/reporters.
i have vacillated on this aspect of metadata. currently i believe that rownames and colnames should be supplied and that the reporterNames must come from there. we now have the reporterData data.frame in there (in annotatedDataset) that can ameliorate the problem of requiring unique reporterNames
4) I think the class of reporterInfor (data.frameOrNULL) is a bit too strict. You give a compelling reason that we might want to give a control/active factor. Now, since the number of reporters are huge, this slot will (if not empty) be a very big structure, so I think we really want to allow a very specific usage of this kind of slot (data.frames are not terrible efficient). I would like the option of having it be either a factor, an integer or a matrix. A possible use scenario (which I strongly advocate) would be the use of an integer to indicate (x,y) position on the chip for AffyBatch-like objects (right now the map between row and (x,y) position in the AffyBatch object is implicit which does not allow for subsetting of the object, since that would break the link).
is the data.frame as a container of a factor really an efficiency loss?
Also, if someone wants to do splitting or the assayData based on a factor, it may be _way_ more efficient to have the split done once and for all (I imagine assayDataControl, assayDataActive) (something which btw is not really doable in the current setup since the two structures would have different dimensions), instead of using a factor to the split "every time". Hmm. I haven't really thought this through.
we do need to think through the split use cases. example, we would like to make it easy for people to compute a normalization function based strictly on control spots.
5) I am not really in favour of the varMetadata slot of the phenoData class, although the vignette seems to indicate that this was included in Bioc 1.6. The only example you include is the specification of units, something I feel belong in the varLabels slot such as "specimen age, in years". As I currently understand it, I feel this is a bit too much annotation. The same goes for a hypothetical reporterMetadata slot. Perhaps you have another usage in mind? There does not seem to be validity checking of this slot?
right, no validity checking yet. you are right that such metadata could be contained in labels, but how do you compute on those labels? if you have a few datasets and need to make years and months variables compatible, a convention on a units method may be helpful. we have one vote (private, a long time ago) in favor of the varMetadata approach and now one against.
6) the assayData slot: I do not really understand the pass-by- reference comments you make in the vignette, but they seem to indicate that there would be performance gains to using an environment. Could you explain this in some more detail. And if there is, I see no reason to allow a list type structure. I think it should be mandatory to have either a list or an environment, allowing both just adds confusion. I would rather have the community choose the most efficient way and then "force" developers to use this.
environments are not copied when passed to functions. everything else is, afaik. why not require environments? it is open for additional discussion
7) So the assayData slot does not have a specific number/names for
its components. I see the need for this. But let us say I want to use
it for a specific case where I have two assays (let us say a two-
color micro array experiment). Do you imagine that people will create
more specific versions of the class by something like (code not tested)
setClass("twoclor", representation("eSet"),
validity = function(object){
if(!validObject(as(object, "eSet")
return(FALSE) ## this might be unnecessary
if(sort(names(assayData(object)) != c("green", "red"))
return(FALSE)
else
return(TRUE)
})
or how do users actually make sure that the elements of the assayData
have the relevant names (and numbers)?
conceptually i think this is right. we want to make sure the basic infrastructure is not missing anything that you would want to have in COMMON to all the different extensions that one can anticipate for high throughput platforms.
"Kasper" == Kasper Daniel Hansen <khansen at stat.berkeley.edu>
on Mon, 5 Sep 2005 15:45:30 -0700 writes:
.................
Kasper> 7) So the assayData slot does not have a specific
Kasper> number/names for its components. I see the need for
Kasper> this. But let us say I want to use it for a specific
Kasper> case where I have two assays (let us say a two-
Kasper> color micro array experiment). Do you imagine that
Kasper> people will create more specific versions of the
Kasper> class by something like (code not tested)
(yes, it was missing 3 closing ")"
--- quickly seen when using Emacs with "paren match" activated)
>> setClass("twoclor", representation("eSet"),
>> validity = function(object){
>> if(!validObject(as(object, "eSet")))
>> return(FALSE) ## this might be unnecessary
>> if(sort(names(assayData(object)) != c("green", "red")))
>> return(FALSE)
>> else
>> return(TRUE)
>> })
I want to comment on the above code, since
I think I've seen the same mistake several times in people's code:
Validity checking should **NOT** return TRUE or FALSE,
but TRUE or <reason for non-validity> .
This has been in `The Green Book' but also the very first entry in
?SetValidity
or ?validObject :
Description:
The validity of 'object' related to its class definition is
tested. If the object is valid, 'TRUE' is returned; otherwise,
either a vector of strings describing validity failures is
returned, or an error is generated (according to whether 'test' is
'TRUE').
The function 'setValidity' sets the validity method of a class
(but more normally, this method will be supplied as the 'validity'
argument to 'setClass'). The method should be a function of one
object that returns 'TRUE' or a description of the non-validity.
Martin
Hi Kasper,
Kasper Daniel Hansen wrote:
Hi Vince and others Below is my first thoughts about the eSet class. I must say that I like small "tight" classes with a strong validity checking. I will start with some specific comments: 1) The history slot: a reasonable idea. But if we have a specific history slot, shouldn't it be filled automatically every time an eSet is created or modified. That is, every replacement function or initialization should update this slot. Otherwise I do not really see the need to keep this slot separate from the notes.
I doubt that such a comprehensive approach will be useful, especially since we do not yet have a markup, or intended mechanism for display or managing the history mechanism. I suspect that at least initially less is going to be more helpful. Perhaps tracking changes to the expressions, or a few other slots would be a good first cut.
2) The dim method: since it is part of your validity checking that every component of the assayData slot has the same dimensions, there is no need to have the dim be a matrix (every column will by definition be the same). You need an internal method to extract the matrix of dimensions, in order to do the validity checking of course...
Vince answered this - we are not yet sure that they would be, and would appreciate examples where they are not.
3) I like the idea of having reportNames separate from the assayData. That also means that the names do not need to be unique. But shoudl sampleNames be a separate slot or just be the rownames of the phenoData slot? These should be some kind of checking that the length of these names or either 0 (no names given) or equal to the number of samples/reporters.
I think that these should be checked in many different ways. Any place that they can be assigned they should be scrutinized and if present we should check that they are the same, and in the same order as those in the phenoData (whether row names on the dataframe or in a special slot).
4) I think the class of reporterInfor (data.frameOrNULL) is a bit too strict. You give a compelling reason that we might want to give a control/active factor. Now, since the number of reporters are huge, this slot will (if not empty) be a very big structure, so I think we really want to allow a very specific usage of this kind of slot (data.frames are not terrible efficient). I would like the option of having it be either a factor, an integer or a matrix. A possible use scenario (which I strongly advocate) would be the use of an integer to indicate (x,y) position on the chip for AffyBatch-like objects (right now the map between row and (x,y) position in the AffyBatch object is implicit which does not allow for subsetting of the object, since that would break the link).
I don't see the inefficiencies you are mentioning? A data.frame is merely a list of vectors and since I don't think we will solve all problems with a single vector of reporterInfo then data.frame is the natural data structure. If you have some other data indicating specifice inefficiencies please provide it. Your example, and others, are what we had in mind.
Also, if someone wants to do splitting or the assayData based on a factor, it may be _way_ more efficient to have the split done once and for all (I imagine assayDataControl, assayDataActive) (something which btw is not really doable in the current setup since the two structures would have different dimensions), instead of using a factor to the split "every time". Hmm. I haven't really thought this through.
Not sure what you are worried about here, but we do envisage some general uses of splitting parts, or all of eSets via different variables that are being made available. Again, it is probably best to see what the real usage patterns are before we commit to the implementation.
5) I am not really in favour of the varMetadata slot of the phenoData class, although the vignette seems to indicate that this was included in Bioc 1.6. The only example you include is the specification of units, something I feel belong in the varLabels slot such as "specimen age, in years". As I currently understand it, I feel this is a bit too much annotation. The same goes for a hypothetical reporterMetadata slot. Perhaps you have another usage in mind? There does not seem to be validity checking of this slot?
I don't see how you could every realistically parse a label and get back what you want (or even know, in some programmatic way that there is valuable information there), your experience may be different.
6) the assayData slot: I do not really understand the pass-by- reference comments you make in the vignette, but they seem to indicate that there would be performance gains to using an environment. Could you explain this in some more detail. And if there is, I see no reason to allow a list type structure. I think it should be mandatory to have either a list or an environment, allowing both just adds confusion. I would rather have the community choose the most efficient way and then "force" developers to use this.
We try not to force much of anything onto developers. Lists and environments are essentially equivalent here, and there is probably no need to impose one or the other. Users/developers need to store things together and to access them by name - lists and environments both provide that capability. If you, or someone else, wants to do some careful time and space comparisons, we would certainly take that under advisement, but for now, we think we have the resources to get this new data structure in place for the next release.
7) So the assayData slot does not have a specific number/names for
its components. I see the need for this. But let us say I want to use
it for a specific case where I have two assays (let us say a two-
color micro array experiment). Do you imagine that people will create
more specific versions of the class by something like (code not tested)
setClass("twoclor", representation("eSet"),
validity = function(object){
if(!validObject(as(object, "eSet")
return(FALSE) ## this might be unnecessary
if(sort(names(assayData(object)) != c("green", "red"))
return(FALSE)
else
return(TRUE)
})
or how do users actually make sure that the elements of the assayData
have the relevant names (and numbers)?
That would be one use, Martin already pointed out one set of
problems, let me suggest that the need to sort seems wrong, as does the
notion that only red and green are valid names ( %in%, toupper, and a
few other functions might make any user of such a class much happier).
You probably also want to run the eSet validity checker.
Thanks again for all the comments,
Robert
Kasper On Sep 2, 2005, at 9:26 AM, Vincent Carey 525-2265 wrote:
We need discussion of the eSet class, which is to take the place of exprSet in the future. eset.Rnw in Biobase/inst/doc has been revised. Please review and discuss. you will need R 2.2 and the latest Biobase to build this vignette. vc
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 981029-1024 206-667-7700 rgentlem at fhcrc.org
Hi, I think the history idea is extremely important (and useful). We analyze a lot of different microarray experiments. As a consequence, we often find ourselves in the position of having to go back and figure out exactly how a particular data set was analyzed when we are writing the methods section for the resulting article or grant proposal. Our experience is that the scripts that supposedly performed the analysis don't always match the objects that were saved, and so we have started implementing a comprehensive history mechanism that does update the history slot every time the object is created or modified. You can take a look at the implementation at http://bioinformatics.mdanderson.org/Software/OOMPA To get this to work, however, we've had to turn processing functions into objects that know how to update the history slot. The benefit is that you can just ask the object what it's history is (or even put it into the summary method). We're in the process of converting the various pieces of "expresso" into "Processor" and "Pipeline" objects so we can handle affymetrix arrays seamlessly, but the code isn't yet in a form that is suitable for public consumption. Best, Kevin --On Tuesday, September 06, 2005 7:19 AM -0700 Robert Gentleman
<rgentlem at fhcrc.org> wrote:
Hi Kasper, Kasper Daniel Hansen wrote:
Hi Vince and others Below is my first thoughts about the eSet class. I must say that I like small "tight" classes with a strong validity checking. I will start with some specific comments: 1) The history slot: a reasonable idea. But if we have a specific history slot, shouldn't it be filled automatically every time an eSet is created or modified. That is, every replacement function or initialization should update this slot. Otherwise I do not really see the need to keep this slot separate from the notes.
I doubt that such a comprehensive approach will be useful, especially since we do not yet have a markup, or intended mechanism for display or managing the history mechanism. I suspect that at least initially less is going to be more helpful. Perhaps tracking changes to the expressions, or a few other slots would be a good first cut.