Skip to content

Version Controlled CRAN Packages

3 messages · Mario Bourgoin, Anthony Damico, Marc Schwartz

#
On Jan 3, 2013, at 8:33 AM, Mario Bourgoin <mob at media.mit.edu> wrote:
I suspect that you will get various responses, so let me offer my ten cents:

1. The old versions of CRAN packages are typically, but possibly not always, available via an "Old Sources" link on each package's page on CRAN. You could use that approach to obtain old source versions of packages. However, it is conceivable that locally compiling and using the archived source version of that same package (eg. where you may have used a precompiled binary on OSX, Windows or even Linux in some cases) could yield behavioral changes over time. Hardware, OS, compiler and other environmental changes (bugs, 32 versus 64 bit, differing compiler options, etc.) could introduce even subtle problems that may perhaps preclude you absolutely replicating results from previous work. Those are especially important to consider for CRAN packages that are not "pure R" (eg. they include C, C++, FORTRAN, etc.).

2. The old versions of contributed CRAN packages that are physically on CRAN are not under a true file level source version control system there. It is up to each package maintainer/author to elect to use such a tool themselves outside of CRAN. R-Forge and GitHub are perhaps the two most popular online platforms, but others may be used and yet others may use local offline repos that you do not have access to. Some may not use a true version control system at all. There is no requirement for or any enforcement of a particular development process for contributed CRAN packages.

3. While R itself is under SVN control, unless you are compiling R from source and keeping track of SVN rev numbers, that is not likely to be helpful to you, if you typically install precompiled binary versions of R. You will want to archive the OS-specific R binaries that you use.

4. As noted above, it is conceivable that running code today versus running that same code five years from now using the same versions of R and CRAN packages that you used today can be problematic. It is not only R and the CRAN packages that are changing, but your hardware, OS, compilers and possible other relevant tools that are highly likely to change as well. All of these factors can contribute to your ability or inability to exactly replicate results over time. Only you can determine just how much of today's R/CRAN installation and computing environment you need to be able to replicate in the future.

5. If you have datasets that you will be using and need to replicate the same results five years from now on the same dataset that you used today, you will need to maintain your datasets (not just your code) in a version control system as well. 

6. You might also want to look into "Reproducible Research".


Bottom line, you have defined or are in the process of defining your own local requirements and perhaps SOPs. Thus, take control of your own risk mitigation process. Implement your own version control system locally, that includes, if you use them, precompiled binaries of R and any CRAN packages that you may use, so that you can replicate the state of an R installation to your own requirements, notwithstanding hardware and OS level changes that will occur. 

You will of course want to document the version of R and any third party packages that you use when performing an analysis, so that you can track such information for future use. 

If you compile and install source versions of R and CRAN packages, then I would keep source level tarballs of each in said version control system so that you can reasonably ensure access to them when you need it, even though they may also be available via CRAN.

I would be sure that such a repo (or more likely, content/project specific repos) are stored on a central server, which is backed up offline with a sufficient frequency and level of redundancy to mitigate loss risk.

The two most popular VC tools these days are SVN and Git. There are significant differences in the implementation models of both, so you will need to take time to consider your own functional and operational requirements, which would may lead you in one direction or the other. That being said, I made the switch from SVN to Git last year, even though I don't need true distributed version control myself. There are various reasons for that switch, which are beyond the scope of this discussion, so I won't get into details here.

I hope that the above is helpful.

Regards,

Marc Schwartz