2016-04-16 22:55 GMT+02:00 Martin Morgan <martin.morgan at roswellpark.org>:
On 04/16/2016 01:09 PM, Marcin Kosi?ski wrote:
Hello,
I would like to ask you all for an advice in the following issue.
Last year I have started working with data from The Cancer Genome Atlas.
During that work out team (https://github.com/orgs/RTCGA/people) have
prepared some tools for downloading and integrating datasets from TCGA
study and provided them in the R package called RTCGA
<https://www.bioconductor.org/packages/3.3/bioc/html/RTCGA.html>, which
is
available on Bioconductor.
Later on we were working on tools for visualizing and analyzing the most
popular datasets from TCGA so we have prepared data packages with those
datasets and submitted them to Bioconductor in 8 separate packages. You
can
read more about them here http://rtcga.github.io/RTCGA/
*I have a question about updating those data packages.* TCGA release
datasets snapshots over time. In the RTCGA family of R packages there are
available datasets from the release date 2015-11-01 but currently one can
check that there was newer release 2016-01-28
tail(RTCGA::checkTCGA('Dates'))
- BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
- BiocDevel: snapshot from 2015-11-01 || package ver 20151101.1.0
- BiocRelease: NOT YET AVAILABLE
- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.4
I think that having datasets from the newest snapshot date is vital for
data analysis, but I wouldn't like to create situations in which 2
separate
analysts use RTCGA.clinical and got different results because they used
different data versions. That's why I have started versioning data
packages
with the number that corresponds to the release date.
This isn't very helpful. There is only ever one version of
'RTCGA.clinical' available per Bioc version, so whether its version is
20151101.1.0 or 1.1.0 wouldn't make a difference to the end user.
Probably you want to include the TCGA release in the package _name_,
'RTCGA.clinical.20151101'. Probably you want to have multiple versions
available at any one time.
Thanks for comments. I haven't considered making separate packages for
separate data releases.
I don't think the experiment data archive is the best solution for
distributing large collections of curated data. It places a burden on our
mirrors to sync the repository and on the svn repository to store it. The
packages are built twice weekly even though the data is very static and in
your case based on unchanging base R data structures. The data are not very
'granular', even though you've done a good job of making the individual
data sets accessible, so a user interested in ovarian cancers, say, would
need to download all data anyway.
Instead I think that these should be ExperimentHub resources. How to add
resources is described in the vignette to the companion package
ExperimentHubData
http://bioconductor.org/packages/devel/bioc/html/ExperimentHubData.html
The data would be stored in Amazon S3 so globally accessible; it would
not be under version control. The ExperimentHub / AnnotationHub cache would
manage local versions, rather than R's package system.
ExperimentHub will be back in active development, including addition of
new resources, immediately after our next release, May 4, so the timing is
fairly good.
Thanks for letting me know. I wasn't aware about such solution. I'll have
a better look at those ExperimentHubs.
I think it is also worth while to discuss how you have chosen to
represent each of the data types, for instance the RNAseq data as a samples
x genes data.frame whereas the Bioconductor convention would store it
primarily as a genes x sample matrix embedded in a SummarizedExperiment (or
at least make it available to the user in that form; there are definitely
advantages to keeping the serialized instance as simple as possible).
Martin Morgan
Biocondcutor
What do you think about such an issue? You can post advices here or on
our
issue list: https://github.com/RTCGA/RTCGA/issues
Thanks for comments,
Marcin
[[alternative HTML version deleted]]