Patrick
Quoting Laurent Gautier <laurent at cbs.dtu.dk>:
I believe that [1] this is not phenoData and [2] it is critical to
?understanding the data set.
The second point says that there should definitively be some place to
store it; the first point suggests that the phenoData slot is not ?ideal.
?One strong argument against including it in the phenoData
slot comes from the situation when replicate assays are performed on
the same sample.
What would make less "ideal" than usual ?
In the R/data.frame paradigm, information is just (partially) repeated
across the rows. The hierarchical relationships (such some rows
representing a common sample) are modelled with a column (several
columns) containing that information. It's pretty much like in one
table within a relational database.
Moreover, I want to stress (again) that the very notion of what is
phenoData, and what is not, is I think very subjective.
What is a "replicate assay" can depend very much on the nature of the
study. To take just one example:
- when looking for somatic mutation (say with SNP arrays, or CGH),
having samples from different tissues/organs are just "replicate"
measures
- when looking at expression pattern changes between tissues, they are
no longer.
Finally, phenoData as meaning "phenotypic data" has probably misused
for quite some time (anyone storing a known mutation status for example
has been storing "genotypic data", as well as anyone storing patient
samples with the hospital / physician / surgeon).
Rather than (suddenly) trying to play the semantic police about what
goes into phenoData, it would be an option to either a) keep the name
and tell it's like that for historical reasons or b) take an generic
name.
"arrayData" does not sound too good to me as it evocates the actual
spot signal (that goes in to AssayData). "covariateData" would be an
option (if only it had one syllable less). The suggested
"experimentData" is a good name (although an abbreviation could be
considered ?).
L.
The phenotypic/clinical/demographic data for
repeated samples is the same, but the experimental characteristics
(run date, etc) can be very different. And, as already pointed out by
several people, there are numerous other experimental characteristics
(such as sample preparation date) that could also affect the
interpretation of the results.
So, I would argue in favor of a new slot (probably implemented as yet
another AnnotatedDataFrame) called something like
?experimentCharacteristics, in which scanDate would be one commonly
used column.
Kevin
James MacDonald wrote:
If by phenoData we want to mean 'Any random information that may or
may not be phenotypic in nature', then scan date should certainly
go there. However, it seems to me that up to this time we have been
very careful about what goes where precisely because we didn't want
to stuff random information in odd places.
To me, the idea of having different slots with names like phenoData
and assayData and featureData implies to the end user what sort of
data are in there.
If we are to store non-phenotypic, non-biological data somewhere, I
think it makes sense to have another slot. All the slots we have in
the eSet class right now are for data that are conceptually quite
different from things like 'who ran these chips' or 'what day they
were run' or whatever. So putting this sort of data in with
phenotypic data makes no sense to me at all.
Jim
Kasper Daniel Hansen <khansen at stat.berkeley.edu> wrote:
I am adding my support to Laurent: I think scanDate is simply
another column in the phenotype info, indeed something I always
put in, if I have it available (well, actually I am usually more
interested in prep date). Putting in a new slot seems counter
intuitive to me.
Kasper
On Jun 18, 2009, at 12:07 , Patrick Aboyoun wrote:
Laurent, The scan dates were singled out originally because we
have encountered data sets at the Hutch that appear to have a
scan date effect and wanted a location to store this
information so it can be included in the analysis. As you
mentioned, there are other variables that could be important as
well and shouldn't be ignored.
Given that you have been actively working towards a solution of
managing array metadata, you can help create a design that can
be implemented in the Biobase package. Martin Morgan is
currently leading this effort and we can start a dialog
off-list (so as not to spam the rest of the developers with
minutiae) with those who are interested to hammer out a
solution to this problem. I think once the requirements are
formally expressed, we can easily put together a design that
meets the user's needs.
Patrick
Laurent Gautier wrote:
Patrick,
The conceptual distinction you want to make can be seen as
artificial.
When you start introducing "arrayData" as a separated entity,
you will soon have to introduce "samplepreparationData" (what
extraction protocol was used, where there any biopsy,
etc...), "imageAnalysisData" (you know grid alignment, spot
segmentation). Is it reasonable to add a slot each time ?
Moreover, those categories can probably also be broken down
into subcategories. Finally, what is making the scanning date
so important ? Wouldn't the version of the software used, or
the scanner, or the scanner settings, or the name of the
person who performed the scanning be of relevance ?
One route would be to construct an initial AnnotatedDataFrame
and populate it with whatever you fancy from the raw-data
files (scan date, software, etc...). I have been going way
with my homebrew infrastructure, and it has so far been
leading to quite much expressivity. Reserved words are not
necessarily very limiting (if sufficiently specific, say
"array_scan_date" and the associated varMetaData = "Date when
scanning the hybridized microarray"), and I'd think better to
carefully design and document what is happening when one is
trying to add an other column with the same name rather than
rely on security-through-obscurity with mangled names.
L.
Patrick Aboyoun wrote:
Laurent, As you mentioned the existing phenoData
infrastructure could be used to house information like scan
dates, scanner model, and scanning software version, but
this information is not conceptually phenotype data and,
and adding it to an AnnotatedDataFrame comes with the
limitation of using reserved words (maybe name mangled like
.__ScanDates__?) for column names in the
AnnotatedDataFrame.
The internal discussion we have been having to making this
more general is to add a different slot (candidate name
arrayData) to eSet (and removing the scanDates slot) that
would house the type of information we have been discussing
in a combination of dedicated slots like scanDates and a
catch all AnnotatedDataFrame slot for less universal data.
This design would separate the array data from the
phenotype data and having dedicating slots for important
information like scan dates would avoid having to manage ?special
columns in an AnnotatedDataFrame.
As you rightly point out we need to ensure we support a
rich suite of functionality like "[", subset, etc., but
this can all be handled through methods for the eSet class.
Keep in mind that this recent change is just a first step,
not a final design, and with your help and input from the
rest of the BioC developer community, we can ensure we end
up with a sufficiently useful microarray data
infrastructure.
Cheers, Patrick
Quoting Laurent Gautier <laurent at cbs.dtu.dk>:
Patrick,
There are indeed always several ways to address needs,
and my comment is mostly pointing at the fact that
creating yet-an-other slot is not necessary since one can
currently store such data into phenoData (into a column
named... say "scan_date").
I would in fact qualify of overbuilding the approach that
adds a new (and exclusive) slot while improving the
exiting infrastructure could perfectly answer the needs.
So today it's "scanDates", and next could be
"scannerModel", or "scanningSoftwareVersion".
I have been a little unclear (even to myself) in my
comment about using "[", so here are more details. *If*
the extract operator was made to evaluate expressions
such as the function subset() does, or in fact if a
method subset was implemented for eSet objects, storing
all information into phenoData makes such things nice:
# silly example: only get the control data scanned in the
future: eset[, scan_date > date() & treatment ==
"control"] # same with subset: subset(eset, , scan_date >
date() & treatment == "control")
# a little longer to write eset[, scanDates(eset) >
date() & pData(eset) == "control"]
If for some reasons a distinction between phenoData and
?like-phenoData-but-can't-be-the-same is needed, please do
consider the creation of an AnnotatedDataFrame that
contains all of them.
L.
Patrick Aboyoun wrote:
Laurent, We had some immediate need for scan date
information and rather than overbuild a system for
managing metadata that we may or may ?not need, we
opted to start simply and then build up as appropriate.
There has been some internal discussions about managing
other metadata along with scan dates, but nothing else
has ?bubbled to the top yet. Your thoughts and design
can help speed up ?this process. The class versioning
system in Biobase supports ?iterative development and
we can make further changes once we lock ?a design in
place. One editorial comment I have is that lots of
designs are possible for a given need and, for example,
the current ?class properly subsets the scanDates ?information
using "[" despite ?not being stored in the
phenoData (AnnotatedDataFrame) slot.
Cheers, Patrick
Quoting Laurent Gautier <laurent at cbs.dtu.dk>:
Hi Patrick,
Storing the scan dates is indeed useful information,
and is it nice to have it offered at the parsing
stage. However, first comment would be "does it
justify a new slot" to eSet ?
I have been storing scan dates for quite some time
now, but opted for having them in the phenoData as it
made more sense to me, both on an implementation
standpoint and on practical standpoint (as standard ?extraction
of an eset-subset on columns with the "["
operator works).
If having something specific for scan dates is really
really wished, would it make make sense to have that
by extending AnnotatedDataFrame ?
In my opinion, the stage at which the the data are
extracted (in that case when parsing the files coming
out of the image analysis) should not dictate where
the data are stored. In fact, it might make it for a
nice(r) workflow if the function reading raw array
data could return an eSet-inheriting instance and a ?phenoData
with information such as dates and file
names. I am working on a workflow that is in fact
getting much more data from the header (I suppose
that I'd contribute it when enough time to wrap it
up).
Just few thoughts,
L.
Patrick Aboyoun wrote:
Dear Bioconductor developers, The Biocore group has
just committed a change to the BioC 2.5 code ?line
(Biobase version 2.5.3) to support the use of microarray ?scan
date in statistical analyses by
adding a scanDates slot to ?Biobase's eSet class.
This information can be ?retrieved and set ?using
the new scanDates and scanDates<- ?function
respectively. The ?scanDates slot is designed to
hold a ?character vector of length = # ?of samples,
with one character ?element for each sample. (See
help(scanDates) for more ?information.)
In this first round of check-ins we have added affy
support of ?this ?new slot to functions like
ReadAffy and we will be working ?towards ?adding
this information to other microarray platforms as
well.
This change involved bumping the eSet version
number from 1.1.0 ?to ?1.2.0 in the Biobase class
definition. In order to minimize ?the ?impact of
this change, the Biobase methods support both the
current ?eSet version 1.2.0 as well as old 1.1.0
serialized ?objects so ?updateObject will not be required ?to be
performed on ?eSet-derived ?objects
prior to use in other functions. We have ?also
tested and ?versioned bumped (and patched where
needed) the ?following packages that create
eSet-derived classes to minimize ?any package build
issues: ACME, beadarray, beadarraySNP, ?cellHTS2,
CGHbase, codelink, ?crlmm, GeneRegionScan, GGBase,
maDB, oligoClasses, ontoTools, puma, ?rMAT,
SNPchip, and spkTools.
Below is a demonstration of the new functionality.
If you encounter ?any issues related to this
change, please e-mail this ?list so the ?community
can monitor the change.
- The Biocore Team
suppressMessages(library(affy)) example(ReadAffy)
RdAffy> if(require(affydata)){ RdAffy+ ? ? ?celpath
<- system.file("celfiles", package="affydata") RdAffy+ ? ? ?fns
<-
list.celfiles(path=celpath,full.names=TRUE) RdAffy+
RdAffy+ ? ? ?cat("Reading ?files:
?\n",paste(fns,collapse="\n"),"\n") RdAffy+
##read a binary celfile RdAffy+ ? ? ?abatch <-
ReadAffy(filenames=fns[1]) RdAffy+ ? ? ?##read a
text celfile RdAffy+ ? ? ?abatch <-
ReadAffy(filenames=fns[2]) RdAffy+ ? ? ?##read all
files in that dir RdAffy+ ? ? ?abatch <-
ReadAffy(celfile.path=celpath) RdAffy+ } Loading
required package: affydata Reading files:
?/Library/Frameworks/R.framework/Versions/2.10/Resources/
library/affydata/celfiles/binary.cel
/Library/Frameworks/
?R.framework/Versions/2.10/Resources/library/affydata/celfiles/
text.cel
binary.cel ? ? ? ? ? ?text.cel "01/23/04 14:30:57"
"08/29/03 15:12:30"
R version 2.10.0 Under development (unstable)
(2009-06-12 r48755) i386-apple-darwin9.6.0
locale: [1]
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages: [1] stats ? ? graphics
grDevices utils ? ? datasets methods ? base other
attached packages: [1] affydata_1.11.6 affy_1.23.2
Biobase_2.5.3 loaded via a namespace (and not
attached): [1] affyio_1.13.3
preprocessCore_1.7.4 tools_2.10.0