Skip to content

[R-pkg-devel] Best practices for distributing large data files

3 messages · Ayala Hernandez, Rafael, Neal Fultz, Greg Minshall

#
Dear all,

I am currently trying to think of the best way to distribute large sets of coefficients required by my package asteRisk.

At the moment, I am using an accessory data package, asteRiskData, available from a drat repository, that bundles all of the required coefficients already parsed and stored as R objects.

However, as my package grows, the amount of data required is also growing. This has made the size of asteRiskData grow larger, reaching 99.99 MB at the moment, which is at the limit of what would be upload able to GitHub. Since the source package must be uploaded a a single .tar.gz file for the drat repository, I see no easy workaround, other than splitting it into multiple, accessory data packages.

I believe this option could become rather troublesome in the future, if the number of accessory data packages starts to grow too much.

So I would like to ask, is there any recommended procedure for distributing such large data files? 

Another option that has been suggested to me is not to use an accessory data package at all, but instead download and parse the required data on demand from the corresponding internet resources, store them locally, and then have future sessions load them from the local copies, therefore not requiring download and parsing in every R session, but only once (or possibly only once in a while, if the associated resource is updated). However, this would be leaving files of relatively large size (several 10s of MB) scattered in the local environment of users (instead of having them all centralized in the accessory data package). Is this option acceptable as well?

Thanks a lot in advance for any insights

Best wishes,

Rafa
#
I host my clients' packages on aws; the cost is minimal, and extremely fast
for installing on other systems on amazon. Here's the script I use:

https://github.com/njnmco/njnmverse/blob/master/Makefile



On Tue, Feb 15, 2022 at 6:55 PM Ayala Hernandez, Rafael <
r.ayala14 at imperial.ac.uk> wrote:

            

  
  
#
Rafael,
i've done this for caching some world bank data.  for my simple use,
this works well.  but, i don't really have any systematic way of
"invalidating" the cache, providing the user with any control over this,
etc., and i consider that a problem.

cheers, Greg