[Bioc-devel] Data Package Size Issues (.idat and .rda)
In that case, I will try to see if the public databases have the kind of data sets I am trying to package and run the idea by the team that is assigned to the project I am developing. Thank you Martin, Sean and Kasper for your valuable insight! --- Nicolas De Jay
On Fri, Nov 8, 2013 at 9:07 AM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
On Fri, Nov 8, 2013 at 8:41 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
On 11/07/2013 09:26 PM, Nicolas De Jay wrote:
Thanks for the prompt answer. The data set I am packaging closely resembles that of minfiData except that there are 52 samples; the IDAT files together are some 800MB whereas the Rda file is closer to 150MB. It is worth noting that my experiment data package will be submitted to Bioconductor along with a software package which makes use of these samples in the vignette. With this in mind, can I omit the IDAT files? If this goes against Bioconductor's underlying design, what would you say is the maximum size of an experiment data package?
Hi Nicolas -- Some things to bear in mind.
Hi, Nicolas. I just wanted to note that experiment data packages are meant as a convenient way to distribute data so that reproducible workflows and documentation can be created easily. There are other options such as accessing the data directly from public repositories using Bioconductor tools that serve the same purpose. While accessing such online resources does necessitate a one-time network connection (after which packages like GEOquery can use locally cached data), when appropriate datasets exist in public repositories, it may be a perfectly viable alternative to experiment data packages. In this particular case, as of today in NCBI GEO, there are 1711 Human 450k samples with IDAT files available. I am not arguing that this route should replace experiment data packages, just that stable public data resources are an alternative to them to consider. Sean
Files are compressed in package tar balls, so your IDAT files may have a considerably smaller effective size. Generally, original text files are a much better way to store external data than Rda files. For instance, rda files require updating when / if the class definition changes, and the provenance and content of the data is unambiguous. Experiment data packages are meant to provide reusable examples for pedagogic purposes. One would hope that minfiData fulfills this requirement. If not, then it would be better to continue the current discussion with Kasper and others in the community to identify an appropriately comprehensive data set for use across many relevant packages. There is no formal statement about the maximum size of experiment data packages, but one would need to make a strong argument for why a Gb of experiment data is necessary (including why existing experiment data packages are fundamentally inadequate), especially if it is to support a single package. Martin
--- Nicolas De Jay On Thu, Nov 7, 2013 at 9:38 PM, Kasper Daniel Hansen <kasperdanielhansen at gmail.com> wrote:
To give some background: it is true that the RGsetEx object (in data/RGsetEx.rda) is a 1-1 correspondence with the raw data files in inst/extdata, so one could consider it redundant. However, having the IDAT files are convenient for testing parsing, and also for other tools who want to have 450k example data and not want to depend on minfi. Those are the two main reasons for including the raw data as well. And then the fact that while the data size is "big" it is only 6 samples. Best, Kasper On Thu, Nov 7, 2013 at 3:58 PM, Nicolas De Jay <nicolas.dejay at mail.mcgill.ca> wrote:
Hi, I am preparing a data package and using the minfiData package as a reference. The .idat files in extdata and the .rda file in data are both present in both the compressed tarball source and the installed copy directory (in my case, under ~/R/x86-64.../3.0/minfiData). Isn't this redundant? Is there a way to have the prospective user only download the .rda files? Sorry if my question is misguided and thanks in advance for your help. --- Nicolas De Jay M.Sc. Student Department of Human Genetics Montreal Children's Hospital Research Institute, McGill University Health Centre 4060 Ste Catherine West, PT-239 Montreal, QC H3Z2Z3, Canada T: (514) 412-4440 | E: nicolas.dejay at mail.mcgill.ca
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel