[Bioc-devel] BioC 2.5: Added scanDates slot to Biobase's eSetclass

Hi Jim, Laurent, Kasper, Henrik --

Quoting James MacDonald <jmacdon at med.umich.edu>:
If by phenoData we want to mean 'Any random information that may or   
may not be phenotypic in nature', then scan date should certainly go  
 there. However, it seems to me that up to this time we have been   
very careful about what goes where precisely because we didn't want   
to stuff random information in odd places.

To me, the idea of having different slots with names like phenoData   
and assayData and featureData implies to the end user what sort of   
data are in there.

If we are to store non-phenotypic, non-biological data somewhere, I   
think it makes sense to have another slot. All the slots we have in   
the eSet class right now are for data that are conceptually quite   
different from things like 'who ran these chips' or 'what day they   
were run' or whatever. So putting this sort of data in with   
phenotypic data makes no sense to me at all.
For the second iteration  of these ideas, we are aiming for a slot,  
say arrayData, that addresses Jim's point, i.e., data about arrays and  
not phenotypes. Our current vision is for an abstract base class  
ArrayData, and initially a derived class with slot scanDate (it's  
difficult to know how literally to interpret this, as Henrik points  
out; we really don't want to have reserved column names in an  
AnnotatedDataFrame, no matter how mangled) and a slot for  
AnnotatedDataFrame for less structured data. There would be the  
expected subset and accessor functionalities.

As Laurent points out, we're taking a bit of a step down a slippery  
slope of additional complexity; we know we want to keep the data as  
simple as possible. The motivation for 'promoting' scanDate to a full  
slot rather than name-mangled column in an ADF is that we think that  
we can reliably (again, modulo Henrik's observation) incorporate this  
at an early stage from the main platforms that end up at ExpressionSet  
and friends.

Martin
Jim

Kasper Daniel Hansen <khansen at stat.berkeley.edu> wrote:
I am adding my support to Laurent: I think scanDate is simply another
column in the phenotype info, indeed something I always put in, if I
have it available (well, actually I am usually more interested in prep
date). Putting in a new slot seems counter intuitive to me.

Kasper

On Jun 18, 2009, at 12:07 , Patrick Aboyoun wrote:

Laurent,
The scan dates were singled out originally because we have
encountered data sets at the Hutch that appear to have a scan date
effect and wanted a location to store this information so it can be
included in the analysis. As you mentioned, there are other
variables that could be important as well and shouldn't be ignored.

Given that you have been actively working towards a solution of
managing array metadata, you can help create a design that can be
implemented in the Biobase package. Martin Morgan is currently
leading this effort and we can start a dialog off-list (so as not to
spam the rest of the developers with minutiae) with those who are
interested to hammer out a solution to this problem. I think once
the requirements are formally expressed, we can easily put together
a design that meets the user's needs.

Patrick

Laurent Gautier wrote:
Patrick,

The conceptual distinction you want to make can be seen as
artificial.

When you start introducing "arrayData" as a separated entity, you
will soon have to introduce "samplepreparationData" (what
extraction protocol was used, where there any biopsy, etc...),
"imageAnalysisData" (you know grid alignment, spot segmentation).
Is it reasonable to add a slot each time ? Moreover, those
categories can probably also be broken down into subcategories.
Finally, what is making the scanning date so important ?
Wouldn't the version of the software used, or the scanner, or the
scanner settings, or the name of the person who performed the
scanning be of relevance ?

One route would be to construct an initial AnnotatedDataFrame and
populate it with whatever you fancy from the raw-data files (scan
date, software, etc...). I have been going way with my homebrew
infrastructure, and it has so far been leading to quite much
expressivity. Reserved words are not necessarily very limiting (if
sufficiently specific, say "array_scan_date" and the associated
varMetaData = "Date when scanning the hybridized microarray"), and
I'd think better to carefully design and document what is happening
when one is trying to add an other column with the same name rather
than rely on security-through-obscurity with mangled names.

L.

Patrick Aboyoun wrote:
Laurent,
As you mentioned the existing phenoData infrastructure could be
used to house information like scan dates, scanner model, and
scanning software version, but this information is not
conceptually phenotype data and, and adding it to an
AnnotatedDataFrame comes with the limitation of using reserved
words (maybe name mangled like .__ScanDates__?) for column names
in the AnnotatedDataFrame.

The internal discussion we have been having to making this more
general is to add a different slot (candidate name arrayData) to
eSet (and removing the scanDates slot) that would house the type
of information we have been discussing in a combination of
dedicated slots like scanDates and a catch all AnnotatedDataFrame
slot for less universal data. This design would separate the array
data from the phenotype data and having dedicating slots for
important information like scan dates would avoid having to manage
special columns in an AnnotatedDataFrame.

As you rightly point out we need to ensure we support a rich suite
of functionality like "[", subset, etc., but this can all be
handled through methods for the eSet class.

Keep in mind that this recent change is just a first step, not a
final design, and with your help and input from the rest of the
BioC developer community, we can ensure we end up with a
sufficiently useful microarray data infrastructure.

Cheers,
Patrick

Quoting Laurent Gautier <laurent at cbs.dtu.dk>:

Patrick,

There are indeed always several ways to address needs, and my
comment
is mostly pointing at the fact that creating yet-an-other slot is
not
necessary since one can currently store such data into phenoData
(into
a column named... say "scan_date").

I would in fact qualify of overbuilding the approach that adds a
new
(and exclusive) slot while improving the exiting infrastructure
could
perfectly answer the needs. So today it's "scanDates", and next
could
be "scannerModel", or "scanningSoftwareVersion".

I have been a little unclear (even to myself) in my comment about
using
"[", so here are more details. *If* the extract operator was made
to
evaluate expressions such as the function subset() does, or in
fact if
a method subset was implemented for eSet objects, storing all
information into phenoData makes such things nice:

# silly example: only get the control data scanned in the future:
eset[, scan_date > date() & treatment == "control"]
# same with subset:
subset(eset, , scan_date > date() & treatment == "control")

# a little longer to write
eset[, scanDates(eset) > date() & pData(eset) == "control"]

If for some reasons a distinction between phenoData and
like-phenoData-but-can't-be-the-same is needed, please do
consider the
creation of an AnnotatedDataFrame that contains all of them.

L.

Patrick Aboyoun wrote:
Laurent,
We had some immediate need for scan date information and rather
than overbuild a system for managing metadata that we may or
may  not need, we opted to start simply and then build up as
appropriate. There has been some internal discussions about
managing other metadata along with scan dates, but nothing else
has  bubbled to the top yet. Your thoughts and design can help
speed up  this process. The class versioning system in Biobase
supports  iterative development and we can make further changes
once we lock  a design in place. One editorial comment I have is
that lots of  designs are possible for a given need and, for
example, the current  class properly subsets the scanDates
information using "[" despite  not being stored in the phenoData
(AnnotatedDataFrame) slot.

Cheers,
Patrick

Quoting Laurent Gautier <laurent at cbs.dtu.dk>:

Hi Patrick,

Storing the scan dates is indeed useful information, and is it
nice to
have it offered at the parsing stage.
However, first comment would be "does it justify a new slot" to
eSet ?

I have been storing scan dates for quite some time now, but
opted for
having them in the phenoData as it made more sense to me, both
on an
implementation standpoint and on practical standpoint (as
standard
extraction of an eset-subset on columns with the "[" operator
works).

If having something specific for scan dates is really really
wished,
would it make make sense to have that by extending
AnnotatedDataFrame ?

In my opinion, the stage at which the the data are extracted
(in that
case when parsing the files coming out of the image analysis)
should
not dictate where the data are stored.
In fact, it might make it for a nice(r) workflow if the function
reading raw array data could return an eSet-inheriting instance
and a
phenoData with information such as dates and file names. I am
working
on a workflow that is in fact getting much more data from the
header (I
suppose that I'd contribute it when enough time to wrap it up).

Just few thoughts,

L.

Patrick Aboyoun wrote:
Dear Bioconductor developers,
The Biocore group has just committed a change to the BioC 2.5
code  line (Biobase version 2.5.3) to support the use of
microarray scan  date in statistical analyses by adding a
scanDates slot to  Biobase's eSet class. This information can
be  retrieved and set  using the new scanDates and
scanDates<-  function respectively. The  scanDates slot is
designed to hold a  character vector of length = #  of
samples, with one character  element for each sample. (See
help(scanDates) for more  information.)

In this first round of check-ins we have added affy support
of  this  new slot to functions like ReadAffy and we will be
working  towards  adding this information to other microarray
platforms as  well.

This change involved bumping the eSet version number from
1.1.0  to  1.2.0 in the Biobase class definition. In order to
minimize  the  impact of this change, the Biobase methods
support both the  current  eSet version 1.2.0 as well as old
1.1.0 serialized  objects so  updateObject will not be
required to be performed on  eSet-derived  objects prior to
use in other functions. We have  also tested and  versioned
bumped (and patched where needed) the  following packages
that create eSet-derived classes to minimize  any package
build  issues: ACME, beadarray, beadarraySNP,  cellHTS2,
CGHbase, codelink,  crlmm, GeneRegionScan, GGBase,  maDB,
oligoClasses, ontoTools, puma,  rMAT, SNPchip, and spkTools.

Below is a demonstration of the new functionality. If you
encounter  any issues related to this change, please e-mail
this  list so the  community can monitor the change.

- The Biocore Team

suppressMessages(library(affy))
example(ReadAffy)
RdAffy> if(require(affydata)){
RdAffy+      celpath <- system.file("celfiles",
package="affydata")
RdAffy+      fns <- list.celfiles(path=celpath,full.names=TRUE)
RdAffy+  RdAffy+      cat("Reading  files:
\n",paste(fns,collapse="\n"),"\n")
RdAffy+      ##read a binary celfile
RdAffy+      abatch <- ReadAffy(filenames=fns[1])
RdAffy+      ##read a text celfile
RdAffy+      abatch <- ReadAffy(filenames=fns[2])
RdAffy+      ##read all files in that dir
RdAffy+      abatch <- ReadAffy(celfile.path=celpath)
RdAffy+ }
Loading required package: affydata
Reading files:
/Library/Frameworks/R.framework/Versions/2.10/Resources/
library/affydata/celfiles/binary.cel   /Library/Frameworks/
R.framework/Versions/2.10/Resources/library/affydata/celfiles/
text.cel
scanDates(abatch)
     binary.cel            text.cel
"01/23/04 14:30:57" "08/29/03 15:12:30"
sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-12
r48755)
i386-apple-darwin9.6.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets
methods   base
other attached packages:
[1] affydata_1.11.6 affy_1.23.2     Biobase_2.5.3
loaded via a namespace (and not attached):
[1] affyio_1.13.3        preprocessCore_1.7.4 tools_2.10.0

_______________________________________________
Bioc-devel at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
**********************************************************
Electronic Mail is not secure, may not be read every day, and should  
 not be used for urgent or sensitive issues

_______________________________________________
Bioc-devel at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] BioC 2.5: Added scanDates slot to Biobase's eSetclass

Thread (7 messages)