enabling reproducible research & R package management & install.package.version & BiocLite
Just my 2 cents: it may not be a good idea to restrict software versions to gain reproducibility. To me, this kind of reproducibility is "dead" reproducibility (what if the old software has a fatal bug? do we want to reproduce the same **wrong** results?). Software packages are continuously evolving, and our research should be adapted as well. How to achieve this? I think this paper by Robert Gentleman and Duncan Temple Lang has given a nice answer: http://biostats.bepress.com/bioconductor/paper2/ With R 3.0.0 coming, it will be easy to achieve what they have outlined because R 3.0 allows custom vignette builders. Basically, your research paper can be built with 'R CMD build' and checked with 'R CMD check' if you provide an appropriate builder. An R package has the great potential of becoming the ideal tool for reproducible research due to its wonderful infrastructure: functions, datasets, examples, unit tests, vignettes, dependency structure, and so on. With the help of version control, you can easily spot the changes after you upgrade the packages. With an R package, you can automate a lot of things, e.g. install.packages() will take care of dependencies and R CMD build can rebuild your paper. Just like Bioc has a devel version, you can continuously check your results in a devel version, so that you know what is going to break if you upgrade to new versions of other packages. Is developing a research paper too different with developing a software package? (in the context of computing) Probably not. Long live the reproducible research! Regards, Yihui -- Yihui Xie <xieyihui at gmail.com> Phone: 515-294-2465 Web: http://yihui.name Department of Statistics, Iowa State University 2215 Snedecor Hall, Ames, IA
On Mon, Mar 4, 2013 at 3:13 PM, Cook, Malcolm <MEC at stowers.org> wrote:
Hi, In support of reproducible research at my Institute, I seek an approach to re-creating the R environments in which an analysis has been conducted. By which I mean, the exact version of R and the exact version of all packages used in a particular R session. I am seeking comments/criticism of this as a goal, and of the following outline of an approach: === When all the steps to an workflow have been finalized === * re-run the workflow from beginning to end * save the results of sessionInfo() into an RDS file named after the current date and time. === Later, when desirous of exactly recreating this analysis === * read the (old) sessionInfo() into an R session * exit with failure if the running version of R doesn't match * compare the old sessionInfo to the currently available installed libraries (i.e. using packageVersion) * where there are discrepancies, install the required version of the package (without dependencies) into new library (named after the old sessionInfo RDS file) Then the analyst should be able to put the new library into the front of .libPaths and run the analysis confident that the same version of the packages. I have in that past used install-package-version.R to revert to previous versions of R packages successfully (https://gist.github.com/1503736). And there is a similar tool in Hadley Wickhams devtools. But, I don't know if I need something special for (BioConductor) packages that have been installed using biocLite and seek advice here. I do understand that the R environment is not sufficient to guarantee reproducibility. Some of my colleagues have suggested saving a virtual machine with all your software/library/data installed. So, I am also in general interested in what other people are doing to this end. But I am most interested in: * is this a good idea * is there a worked out solution * does biocLite introduce special cases * where do the dragons lurk ... and the like Any tips? Thanks, ~ Malcolm Cook Stowers Institute / Computation Biology / Shilatifard Lab