[R-pkg-devel] Retrieving versioned csv datasets for use in an R package
Thanks so much Rafael, I think piggyback is exactly what I was looking for.
I wonder if it is possible/best practice to include a call to it during the
install.packages('MyPackage') process so that the data is available prior
to running tests in the R CMD build Github Action (and also for users to
have the default/most recent dataset) downloaded alongside the package.
-John
On Fri, Feb 14, 2025 at 4:08?PM Rafael H. M. Pereira <
rafa.pereira.br at gmail.com> wrote:
Hi John,
There are different alternatives on where to host the data (e.g. OSF, a
proprietary server, Github etc). The solution I've been adopting in most of
my packages is to use a combination of a proprietary server and Github.
So the data is first downloaded from our own server and only if our server
is offline, then the download is redirected to Github. This is what I try
to do so our packages do not overload Github. Of course, this creates some
additional work from our side to make sure the files in our server are
always mirrored on github.
A key point to pay attention to when hosting the data on Github is to host
it as an attachment to a *release* . A good way to manage the files and
releases is using the {piggyback} package, by Carl Boettiger et al at
ROpenSci. The documentation of the package is a really great guide on how
to host data on github and it has some really convenient functions to
create releases, upload and download files. Kudos to them !
https://docs.ropensci.org/piggyback/
Best,
Rafael Pereira
On Fri, Feb 14, 2025 at 11:55?AM John Clarke <
john.clarke at cornerstonenw.com> wrote:
Hi folks,
I've looked around for this particular question, but haven't found a good
answer. I have a versioned dataset that includes about 6 csv files that
total about 15MB for each version. The versions get updated every few
years
or so and are used to drive the model which was written in C++ but is now
inside an Rcpp wrapper. Apart from the fact that CRAN does not permit
large
files, I want to have a better way for users to access particular versions
of the dataset.
Usage idea:
# The following would hopefully also download default/most recent version
of the csv files from CRAN (if allowed) or Github or some other repository
for academic open source data.
install.packages("MyPackage")
mypackage = new(MyPackage)
Then, if necessary, the user could change the dataset used with something
like:
mypackage.dataset("2.1.0") which would retrieve new csv files if they
haven't already been downloaded and update the data_folder path internally
to point to 2.1.0 directory.
Requirements:
- The dataset is csv (not a R data object) and the Rcpp MyPackage expects
this format
- Would be nice to properly include citations for the data as they will
likely be initially released through a journal publication
What is the best practice for this sort of dataset management for a
package
in R? Is it okay to use Github to store and version the data? Or
preferred to use an R package (ignoring the file size limit). Or some
other
open source data hosting? I see https://r-universe.dev/ as an option as
well. In any case, what is the proper mechanism for retrieving/caching the
data?
Thanks,
-John
John Clarke | Senior Technical Advisor |
Cornerstone Systems Northwest | john.clarke at cornerstonenw.com
[[alternative HTML version deleted]]
______________________________________________ R-package-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel