Since we are talking about eSets, I thought I would ask a couple more questions. 1) What is the thinking as to "standard use" for two-color data and eSets? 2) With regard to reporterInfo, is there going to be a constraint to keep the single-column data.frame idea as presented in the vignette, or is that slot meant to be more flexible (left as a generic data.frame)? I ask here because I get the sense that these things are developing, still. If I am wrong, I can certainly redirect to the Bioc list. Thanks, Sean
[Bioc-devel] Couple of eSet questions
10 messages · Sean Davis, Seth Falcon, Vincent Carey +1 more
On 2 Feb 2006, sdavis2 at mail.nih.gov wrote:
I ask here because I get the sense that these things are developing, still. If I am wrong, I can certainly redirect to the Bioc list.
This is a good forum for the discussion.
2) With regard to reporterInfo, is there going to be a constraint
to keep
the single-column data.frame idea as presented in the vignette, or is that
slot meant to be more flexible (left as a generic data.frame)?
I would like to see the reporterInfo be an instance of a class much like the phenoData in that it is data.frame like but also has some description of what is in it. So to me, the phenoData class might be better named AnnotatedDataFrame and then both the phenoData and reporterInfo slots can have one of these. + seth
More thoughts on eSet...
On 2 Feb 2006, sdavis2 at mail.nih.gov wrote:
Since we are talking about eSets, I thought I would ask a couple
more questions.
1) What is the thinking as to "standard use" for two-color data and
eSets?
That's a good question. For me it points to a downside of the notion of an "everything set". In its currnet form, "standard use" for two-color data would require a convention for how to name the assayData elements. I think the notion of grouping together assay data that shares reporterInfo and phenoData is a good one, but I wonder if the actual data storage/access is best handled by technology specific subclasses. Here are three usage scenarios: 1. Two-color data. Want to store two expression matrices for red and green scans along with an associated standard error matrix for each. 2. Time-course data. Same samples, same chip. Want to store timepoint expression matrix pairs. 3. Combine 1 & 2. While I can imagine ways to use eSet for all three, they would all require ad-hoc naming schemes for the elements of the assayData slot. Here, there may be a real advantage to a technology-specific subclass that defines an appropriate structure (redExprs, greenExprs, etc). Another angle is to search for methods that make sense across different technologies: description() notes() annotation() history()? phenoData() reporterInfo() sampleNames() reporternames()
2) With regard to reporterInfo, is there going to be a constraint
to keep
the single-column data.frame idea as presented in the vignette, or is that
slot meant to be more flexible (left as a generic data.frame)?
Seems to me that reporterInfo and phenoData slots should contain the same class, a data.frame with some additional meta data. + seth
On 2/3/06 1:36 PM, "Seth Falcon" <sfalcon at fhcrc.org> wrote:
More thoughts on eSet... On 2 Feb 2006, sdavis2 at mail.nih.gov wrote:
Since we are talking about eSets, I thought I would ask a couple
more questions.
1) What is the thinking as to "standard use" for two-color data and
eSets?
That's a good question. For me it points to a downside of the notion of an "everything set". In its currnet form, "standard use" for two-color data would require a convention for how to name the assayData elements. I think the notion of grouping together assay data that shares reporterInfo and phenoData is a good one, but I wonder if the actual data storage/access is best handled by technology specific subclasses. Here are three usage scenarios: 1. Two-color data. Want to store two expression matrices for red and green scans along with an associated standard error matrix for each.
What about a "channel" data structure which would include the label used (Cy3, Cy5, etc.), of which there can be 1..2? I still think there needs to be an accepted data matrix and associated error matrix for the whole dataset, though. This would represent the normalized and processed data to be used in any further analysis.
2. Time-course data. Same samples, same chip. Want to store timepoint expression matrix pairs.
Time course data needs to be combined with metadata in order to be useful. I'm not sure that two-color data and time-course data are analogous from that point of view, but I may just be thinking about it slightly differently. Sean
On 3 Feb 2006, sdavis2 at mail.nih.gov wrote:
1. Two-color data. Want to store two expression matrices for red and green scans along with an associated standard error matrix for each.
What about a "channel" data structure which would include the label used (Cy3, Cy5, etc.), of which there can be 1..2?
I was making up the specifics and it shows ;-) Where I think we agree, is that technology specific data structures might be best represented as part of the class structure not left to an ad-hoc how to use list names setup.
I still think there needs to be an accepted data matrix and associated error matrix for the whole dataset, though. This would represent the normalized and processed data to be used in any further analysis.
2. Time-course data. Same samples, same chip. Want to store timepoint expression matrix pairs.
Time course data needs to be combined with metadata in order to be useful. I'm not sure that two-color data and time-course data are analogous from that point of view, but I may just be thinking about it slightly differently.
Right, this was supposed to be not-analogous apart from wanting to have a single place to put the metadata. + seth
it sounds to me as if we are finding some use cases, and that we want to extend eSet to cope with those that have well-defined requirements. all we know about assayData at present is that it is either a list or an environment. for twochannel data we may want to define a specific extension of listOrEnv (i think this is possible) that has guaranteed structure and names. for data with standard errors we might have to do likewise. but i would say that part of the intention of the eSet is to require the developer always to allow an environment representation for the assayData. the validity criteria can impose restrictions on this environment. eSet itself does not need to solve the two channel or error-available problems at once. it should be extended to do so, with explicit use cases stated. --- Vince Carey, PhD Assoc. Prof Med (Biostatistics) Harvard Medical School Channing Laboratory - ph 6175252265 fa 6177311541 181 Longwood Ave Boston MA 02115 USA stvjc at channing.harvard.edu
On Fri, 3 Feb 2006, Seth Falcon wrote:
On 3 Feb 2006, sdavis2 at mail.nih.gov wrote:
1. Two-color data. Want to store two expression matrices for red and green scans along with an associated standard error matrix for each.
What about a "channel" data structure which would include the label used (Cy3, Cy5, etc.), of which there can be 1..2?
I was making up the specifics and it shows ;-) Where I think we agree, is that technology specific data structures might be best represented as part of the class structure not left to an ad-hoc how to use list names setup.
I still think there needs to be an accepted data matrix and associated error matrix for the whole dataset, though. This would represent the normalized and processed data to be used in any further analysis.
2. Time-course data. Same samples, same chip. Want to store timepoint expression matrix pairs.
Time course data needs to be combined with metadata in order to be useful. I'm not sure that two-color data and time-course data are analogous from that point of view, but I may just be thinking about it slightly differently.
Right, this was supposed to be not-analogous apart from wanting to have a single place to put the metadata. + seth
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
On 3 Feb 2006, stvjc at channing.harvard.edu wrote:
it sounds to me as if we are finding some use cases, and that we want to extend eSet to cope with those that have well-defined requirements. all we know about assayData at present is that it is either a list or an environment. for twochannel data we may want to define a specific extension of listOrEnv (i think this is possible) that has guaranteed structure and names. for data with standard errors we might have to do likewise. but i would say that part of the intention of the eSet is to require the developer always to allow an environment representation for the assayData. the validity criteria can impose restrictions on this environment.
Respectfully, I think I disagree. I would like to have the use cases drive the design of specific subclasses of an eSet-like class where the structure is expressed as part of the class definition (e.g. exprSet should have an exprs slot, not a named element of an env). Pushing the definition of the structure to the validity function makes the actual structure harder to see (IMO) and I'm concerned that it will make extensions which otherwise would be trivial subclasses, tricky. I guess a part of my objection is that it feels as though we will be implementing our own mini class system where slots are the named elements of an env. A compelling argument for forcing the actual data to be in an environment is to avoid copying.
eSet itself does not need to solve the two channel or error-available problems at once. it should be extended to do so, with explicit use cases stated.
Yes, I'm just not convinced that eSet has any business having actual data slots; those are the domain of its subclasses. Putting aside my perhaps ideological objections, maybe a compromise is to work on some of the concrete subclasses (two-channel data being one good example) and factor out the common elements as the evolve. + seth
On 3 Feb 2006, stvjc at channing.harvard.edu wrote:
it sounds to me as if we are finding some use cases, and that we want to extend eSet to cope with those that have well-defined requirements. all we know about assayData at present is that it is either a list or an environment. for twochannel data we may want to define a specific extension of listOrEnv (i think this is possible) that has guaranteed structure and names. for data with standard errors we might have to do likewise. but i would say that part of the intention of the eSet is to require the developer always to allow an environment representation for the assayData. the validity criteria can impose restrictions on this environment.
Respectfully, I think I disagree. I would like to have the use cases drive the design of specific subclasses of an eSet-like class where the structure is expressed as part of the class definition (e.g. exprSet should have an exprs slot, not a named element of an env).
I think the use case should be used to describe what behaviors we want and then the representation can be chosen to allow those behaviors. The internal representation should be subordinate to the methods that expose the structure through its behaviors. Now perhaps my comment on requiring the developer to allow for an environment is inconsistent with this position.
Pushing the definition of the structure to the validity function makes the actual structure harder to see (IMO) and I'm concerned that it will make extensions which otherwise would be trivial subclasses, tricky. I guess a part of my objection is that it feels as though we will be implementing our own mini class system where slots are the named elements of an env.
I am open to discussion of how the visibility of a structure needs to be cared for in our project. If it is only visible through methods, it shouldn't matter whether the information components are slots or environment elements. And the class designers should be free to change the internal representations without downstream consequences. We have not achieved this in several domains -- should we be trying harder?
A compelling argument for forcing the actual data to be in an environment is to avoid copying.
That's the intention -- but not to force people to use environments, but to allow them to do so when it makes sense to do so.
eSet itself does not need to solve the two channel or error-available problems at once. it should be extended to do so, with explicit use cases stated.
Yes, I'm just not convinced that eSet has any business having actual data slots; those are the domain of its subclasses.
It would be good to get some agreement on this. I would say that the eSet schematizes high throughput assay data. We hold, to some benefit, that there needs to be an assayData component, and that it has reporterInfo and phenoData by virtue of its extension of annotatedDataset. Less than this and we are not solving the high-throughput problem. It has some other slots that I am not so sure about that seem to be there for continuity with the previous incarnation.
Putting aside my perhaps ideological objections, maybe a compromise is to work on some of the concrete subclasses (two-channel data being one good example) and factor out the common elements as the evolve.
i agree. i don't mean to be very pedantic about the representation/ method access concepts ... i have to run now.
2 days later
On 2/3/06 4:05 PM, "Vincent Carey 525-2265" <stvjc at channing.harvard.edu> wrote:
eSet itself does not need to solve the two channel or error-available problems at once. it should be extended to do so, with explicit use cases stated.
Yes, I'm just not convinced that eSet has any business having actual data slots; those are the domain of its subclasses.
I'm just curious as to when and by whom eSets will be subclassed? Will that be left to individual package developers, or will these subclasses be available in biobase for general consumption? Sean
Sean Davis wrote:
On 2/3/06 4:05 PM, "Vincent Carey 525-2265" <stvjc at channing.harvard.edu> wrote:
eSet itself does not need to solve the two channel or error-available problems at once. it should be extended to do so, with explicit use cases stated.
Yes, I'm just not convinced that eSet has any business having actual data slots; those are the domain of its subclasses.
I'm just curious as to when and by whom eSets will be subclassed? Will that be left to individual package developers, or will these subclasses be available in biobase for general consumption?
Ideally there would be platform/experimentt specific subclasses.
Initially I would expect them all to be different, but overtime for
consensus to appear. For example, we would be really supportive of some
move on the arrayCGH front that went for a standardized data format (and
some form of eSet could be used). For flow cytometry the prada package
has already done something along these lines and we make some use of it
in rflowcyt.
I doubt it would all go in Biobase, but rather that there might be a
aCGHBase and a flowBase etc that would help to accommodate different groups.
Whether that happens depends a lot on those folks. We try pretty hard
not to enforce any particular style but rather to show that there are
benefits from using common formats and to having the data be as
self-describing as possible.
Hopefully, once we make some progress on the microarray experiment
repository more people will see the benefits of this approach, but maybe
not.
best wishes
Robert
Sean
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org