[RFC] A case for freezing CRAN
Michael Weylandt <michael.weylandt at gmail.com> writes:
On Mar 19, 2014, at 22:17, Gavin Simpson <ucfagls at gmail.com> wrote:
Michael, I think the issue is that Jeroen wants to take that responsibility out of the hands of the person trying to reproduce a work. If it used R 3.0.x and packages A, B and C then it would be trivial to to install that version of R and then pull down the stable versions of A B and C for that version of R. At the moment, one might note the packages used and even their versions, but what about the versions of the packages that the used packages rely upon & so on? What if developers don't state know working versions of dependencies?
Doesn't sessionInfo() give all of this? If you want to be very worried about every last bit, I suppose it should also include options(), compiler flags, compiler version, BLAS details, etc. (Good talk on the dregs of a floating point number and how hard it is to reproduce them across processors http://www.youtube.com/watch?v=GIlp4rubv8U)
In principle yes - but this calls specifically for a package which is extracting the info and stores it into a human readable format, which can then be used to re-install (automatically) all the versions for (hopefully) reproducibility - because if there are external libraries included, you HAVE problems.
The problem is how the heck do you know which versions of packages are needed if developers don't record these dependencies in sufficient detail? The suggested solution is to freeze CRAN at intervals alongside R releases. Then you'd know what the stable versions were.
Only if you knew which R release was used.
Well - that would be easier to specify in a paper then the version infos of all packages needed - and which ones of the installed ones are actually needed? OK - the ones specified in library() calls. But wait - there are dependencies, imports, ... That is a lot of digging - I wpul;d not know how to do this out of my head, except by digging through the DESCRIPTION files of the packages...
Or we could just get package developers to be more thorough in documenting dependencies. Or R CMD check could refuse to pass if a package is listed as a dependency but with no version qualifiers. Or have R CMD build add an upper bound (from the current, at build-time version of dependencies on CRAN) if the package developer didn't include and upper bound. Or... The first is unliekly to happen consistently, and no-one wants *more* checks and hoops to jump through :-) To my mind it is incumbent upon those wanting reproducibility to build the tools to enable users to reproduce works.
But the tools already allow it with minimal effort. If the author can't even include session info, how can we be sure the version of R is known. If we can't know which version of R, can we ever change R at all? Etc to absurdity. My (serious) point is that the tools are in place, but ramming them down folks' throats by intentionally keeping them on older versions by default is too much.
When you write a paper or release a tool, you will have tested it with a specific set of packages. It is relatively easy to work out what those versions are (there are tools in R for this). What is required is an automated way to record that info in an agreed upon way in an approved file/location, and have a tool that facilitates setting up a package library sufficient with which to reproduce a work. That approval doesn't need to come from CRAN or R Core - we can store anything in ./inst.
I think the package version and published paper cases are different. For the latter, the recipe is simple: if you want the same results, use the same software (as noted by sessionInfoPlus() or equiv)
Dependencies, imports, package versions, ... not that straight forward I would say.
For the former, I think you start straying into this NP complete problem: http://people.debian.org/~dburrows/model.pdf Yes, a good config can (and should be recorded) but isn't that exactly what sessionInfo() gives?
Reproducibility is a very important part of doing "science", but not everyone using CRAN is doing that. Why force everyone to march to the reproducibility drum? I would place the onus elsewhere to make this work.
Agreed: reproducibility is the onus of the author, not the reader
Exactly - but also the authors of the software which is aimed at being used in the context of reproducibility - the tools should be there to make it easy! My points are: 1) I think the snapshot idea of CRAN is a good idea which should be followed 2) The snapshots should be incorporated at CRAN as I assume that CRAN will be there longer then any third party repository. 3) the default for the user should *not* change, i.e. normal users will always get the newest packages as it is now 4) If this can / will not be done because of workload, storage space, ... commands should be incorporated in a package (preferably which becomes part of the core packages) to store snapshots of installed package and R version information as a human readable text file, but which can be parsed by a second command to re-create this setup. Cheers, and thanks for this important discussion (could have been a GSoC project?), Rainer
Gavin A scientist, very much interested in reproducibility of my work and others.
Michael In finance, where we call it "Auditability" and care very much as well :-) [[alternative HTML version deleted]]
Rainer M. Krug email: Rainer<at>krugs<dot>de PGP: 0x0F52F982 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 494 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140320/0cb9bc6f/attachment.bin>