[Bioc-devel] Update of data packages in RTCGA Family/Factory of R Packages

Tue, Apr 26, 2016 11:35 AM

I have read from vignette that

2 Adding resources

Resources are contributed to ExperimentHub in the form of a package. The
package contains the resource metadata, man pages, vignette and any
supporting R functions the author wants to provide. This is a similar
design to the existing Bioconductor experimental data packages except the
data are uploaded to AWS S3 buckets instead of stored in a data/ directory
as part of the pacakge.

New packages should be submitted to the Bioconductor tracker and will have
a full review. Contact packages at bioconductor.org for more information.


So If I'd like to provide newer datasets from the newest TCGA release of
data snapshot then I should upload new packages via bioconductor tracker
but in a little different package design than in Experimental Data package.

You said that

*ExperimentHub will be back in active development, including addition of
new resources, immediately after our next release, May 4, so the timing is
fairly good.*

Does it mean I should upload these data packages before May 4th or after?

2016-04-18 20:04 GMT+02:00 Marcin Kosi?ski <m.p.kosinski at gmail.com>:


2016-04-16 22:55 GMT+02:00 Martin Morgan <martin.morgan at roswellpark.org>:


On 04/16/2016 01:09 PM, Marcin Kosi?ski wrote:

Hello,

I would like to ask you all for an advice in the following issue.

Last year I have started working with data from The Cancer Genome Atlas.
During that work out team (https://github.com/orgs/RTCGA/people) have
prepared some tools for downloading and integrating datasets from TCGA
study and provided them in the R package called RTCGA
<https://www.bioconductor.org/packages/3.3/bioc/html/RTCGA.html>, which
is
available on Bioconductor.

Later on we were working on tools for visualizing and analyzing the most
popular datasets from TCGA so we have prepared data packages with those
datasets and submitted them to Bioconductor in 8 separate packages. You
can
read more about them here http://rtcga.github.io/RTCGA/

*I have a question about updating those data packages.* TCGA release
datasets snapshots over time. In the RTCGA family of R packages there are
available datasets from the release date 2015-11-01 but currently one can
check that there was newer release 2016-01-28

tail(RTCGA::checkTCGA('Dates'))

[1] "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01"
"2016-01-28"

I am wondering whether should we upload newer datasets to those data
packages. We have found that there are great differences in results of
data
analysis depending on from which release date one has took datasets. More
about this issue can be found here:
http://rtcga.github.io/RTCGA/Usecases.html#tcga-and-the-curse-of-bigdata

The current state of RTCGA family of R packages is listed below

RTCGA.clinical
<
http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.clinical.html

   - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
   - BiocDevel: snapshot from 2015-11-01  || package ver 20151101.1.0

RTCGA.rnaseq
<
http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.rnaseq.html

   - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
   - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0

RTCGA.mutations
<
http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mutations.html

   - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
   - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0

---------------------------------------------------

RTCGA.methylation
<
http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.methylation.html

   - BiocRelease: NOT YET AVAILABLE
   - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.1


RTCGA.CNV
<
http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.CNV.html

   - BiocRelease: NOT YET AVAILABLE
   - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.5


RTCGA.RPPA
<
http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.RPPA.html

   - BiocRelease: NOT YET AVAILABLE
   - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.6


RTCGA.mRNA
<
http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mRNA.html

   - BiocRelease: NOT YET AVAILABLE
   - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.3


RTCGA.miRNASeq
<
http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.miRNASeq.html

   - BiocRelease: NOT YET AVAILABLE
   - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.4


I think that having datasets from the newest snapshot date is vital for
data analysis, but I wouldn't like to create situations in which 2
separate
analysts use RTCGA.clinical and got different results because they used
different data versions. That's why I have started versioning data
packages
with the number that corresponds to the release date.

This isn't very helpful. There is only ever one version of
'RTCGA.clinical' available per Bioc version, so whether its version is
20151101.1.0 or 1.1.0 wouldn't make a difference to the end user.

Probably you want to include the TCGA release in the package _name_,
'RTCGA.clinical.20151101'. Probably you want to have multiple versions
available at any one time.

Thanks for comments. I haven't considered making separate packages for
separate data releases.

I don't think the experiment data archive is the best solution for
distributing large collections of curated data. It places a burden on our
mirrors to sync the repository and on  the svn repository to store it. The
packages are built twice weekly even though the data is very static and in
your case based on unchanging base R data structures. The data are not very
'granular', even though you've done a good job of making the individual
data sets accessible, so a user interested in ovarian cancers, say, would
need to download all data anyway.

Instead I think that these should be ExperimentHub resources. How to add
resources is described in the vignette to the companion package
ExperimentHubData


http://bioconductor.org/packages/devel/bioc/html/ExperimentHubData.html

The data would be stored in Amazon S3 so globally accessible; it would
not be under version control. The ExperimentHub / AnnotationHub cache would
manage local versions, rather than R's package system.

ExperimentHub will be back in active development, including addition of
new resources, immediately after our next release, May 4, so the timing is
fairly good.

Thanks for letting me know. I wasn't aware about such solution. I'll have
a better look at those ExperimentHubs.

I think it is also worth while to discuss how you have chosen to
represent each of the data types, for instance the RNAseq data as a samples
x genes data.frame whereas the Bioconductor convention would store it
primarily as a genes x sample matrix embedded in a SummarizedExperiment (or
at least make it available to the user in that form; there are definitely
advantages to keeping the serialized instance as simple as possible).

I've been informed about Bioconductor structures. There is additional
function RTCGA::convertTCGA (in devel) that transpoze expression data sets
(rnaseq, miRNASeq, mRNA, methylation, etc) and embs them in ExpressionSet

https://github.com/RTCGA/RTCGA/blob/master/R/convertTCGA.R#L116-L122

Marcin Kosi?ski,
RTCGA

Martin Morgan
Biocondcutor

What do you think about such an issue? You can post advices here or on
our
issue list: https://github.com/RTCGA/RTCGA/issues

Thanks for comments,
Marcin

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Update of data packages in RTCGA Family/Factory of R Packages

Thread (4 messages)