[R-pkg-devel] Large Data Package CRAN Preferences
Hi Uwe,
Thanks for this information, and it makes sense to me. Is there a preferred way to cache the data locally?
None of the ways that I can think to cache the data sound particularly good, and I wonder if I'm missing something. The ideas that occur to me are:
1. Download them into the package directory `path.package("datapkg")`, but that would require an action to be performed on package installation, and I'm unaware of any way to trigger an action on installation.
2. Have a user-specified cache directory (e.g. `options("datapkg_cache"="/my/cache/location")`), but that would require interaction with every use. (Not horrible, but it will likely significantly increase the number of user issues with the package.)
3. Have a user-specified cache directory like #2, but have it default to somewhere in their home directory like `file.path(Sys.getenv("HOME"), "datapkg_cache")` if they have not set the option.
To me #3 sounds best, but I'd like to be sure that I'm not missing something.
Thanks,
Bill
-----Original Message-----
From: Uwe Ligges <ligges at statistik.tu-dortmund.de>
Sent: Sunday, December 15, 2019 11:54 AM
To: bill at denney.ws; r-package-devel at r-project.org
Subject: Re: [R-pkg-devel] Large Data Package CRAN Preferences
Ideally yoiu wpuld host the data elsewhere and submit a CRAN package that allows users to easily get/merge/aggregate the data.
Best,
Uwe Ligges
On 12.12.2019 20:55, bill at denney.ws wrote:
Hello, I have two questions about creating data packages for data that will be updated and in total are >5 MB in size. The first question is: In the CRAN policy, it indicates that packages should be ?5 MB in size in general. Within a package that I'm working on, I need access to data that are updated approximately quarterly, including the historical datasets (specifically, these are the SDTM and CDASH terminologies in https://evs.nci.nih.gov/ftp1/CDISC/SDTM/Archive/). Current individual data updates are approximately 1 MB when individually saved as .RDS, and the total current set is about 20 MB. I think that the preferred way to generate these packages since there will be future updates is to generate one data package for each update and then have an umbrella package that will depend on each of the individual data update packages. That seems like it will minimize space requirements on CRAN since old data will probably never need to be updated (though I will need to access it). Is that an accurate summary of the best practice for creating these as a data package? And a second question is: Assuming the best practice is the one I described above, the typical need will be to combine the individual historical datasets for local use. An initial test of the time to combine the data indicates that it would take about 1 minute to do, but after combination, the result could be loaded faster. I'd like to store the combined dataset locally with the umbrella package. I believe that it is considered poor form to write within the library location for a package except during installation. What is the best practice for caching the resulting large dataset which is locally-generated? Thanks, Bill [[alternative HTML version deleted]]
______________________________________________ R-package-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel