Skip to content

enabling reproducible research & R package management & install.package.version & BiocLite

19 messages · Aaron Mackey, Dan Tenenbaum, Yihui Xie +10 more

#
Hi,

In support of reproducible research at my Institute, I seek an approach to re-creating the R environments in which an analysis has been conducted.

By which I mean, the exact version of R and the exact version of all packages used in a particular R session.

I am seeking comments/criticism of this as a goal, and of the following outline of an approach:

=== When all the steps to an workflow have been finalized ===
* re-run the workflow from beginning to end
* save the results of sessionInfo() into an RDS file named after the current date and time.

=== Later, when desirous of exactly recreating this analysis ===
* read the (old) sessionInfo() into an R session
* exit with failure if the running version of R doesn't match
* compare the old sessionInfo to the currently available installed libraries (i.e. using packageVersion)
* where there are discrepancies, install the required version of the package (without dependencies) into new library (named after the old sessionInfo RDS file)

Then the analyst should be able to put the new library into the front of .libPaths and run the analysis confident that the same version of the packages.

I have in that past used install-package-version.R  to revert to previous versions of R packages successfully (https://gist.github.com/1503736).  And there is a similar tool in Hadley Wickhams devtools.

But, I don't know if I need something special for (BioConductor) packages that have been installed using biocLite and seek advice here.

I do understand that the R environment is not sufficient to guarantee reproducibility.   Some of my colleagues have suggested saving a virtual machine with all your software/library/data installed. So, I am also in general interested in what other people are doing to this end.  But I am most interested in:

* is this a good idea
* is there a worked out solution
* does biocLite introduce special cases
* where do the dragons lurk

... and the like

Any tips?

Thanks,

~ Malcolm Cook
Stowers Institute / Computation Biology / Shilatifard Lab
#
On Mon, Mar 4, 2013 at 4:28 PM, Aaron Mackey <amackey at virginia.edu> wrote:
Sounds like the best bet -- maybe tools like vagrant might be useful here:

http://www.vagrantup.com

... or maybe they're overkill?

Haven't really checked it out myself too much, my impression is that
these tools (vagrant, chef, puppet) are built to handle such cases.

I'd imagine you'd probably need a location where you can grab the
precise (versioned) packages for the things you are specifying, but
...

-steve
#
On Mon, Mar 4, 2013 at 2:15 PM, Steve Lianoglou
<mailinglist.honeypot at gmail.com> wrote:
Right...and this is a bit tricky, because we don't keep old versions
around in our BioC software repositories.  They are available through
Subversion but with the sometimes additional overhead of setting up
build-time dependencies.

Dan
#
Just my 2 cents: it may not be a good idea to restrict software
versions to gain reproducibility. To me, this kind of reproducibility
is "dead" reproducibility (what if the old software has a fatal bug?
do we want to reproduce the same **wrong** results?). Software
packages are continuously evolving, and our research should be adapted
as well. How to achieve this? I think this paper by Robert Gentleman
and Duncan Temple Lang has given a nice answer:
http://biostats.bepress.com/bioconductor/paper2/

With R 3.0.0 coming, it will be easy to achieve what they have
outlined because R 3.0 allows custom vignette builders. Basically,
your research paper can be built with 'R CMD build' and checked with
'R CMD check' if you provide an appropriate builder. An R package has
the great potential of becoming the ideal tool for reproducible
research due to its wonderful infrastructure: functions, datasets,
examples, unit tests, vignettes, dependency structure, and so on. With
the help of version control, you can easily spot the changes after you
upgrade the packages. With an R package, you can automate a lot of
things, e.g. install.packages() will take care of dependencies and R
CMD build can rebuild your paper.

Just like Bioc has a devel version, you can continuously check your
results in a devel version, so that you know what is going to break if
you upgrade to new versions of other packages. Is developing a
research paper too different with developing a software package? (in
the context of computing) Probably not.

Long live the reproducible research!

Regards,
Yihui
--
Yihui Xie <xieyihui at gmail.com>
Phone: 515-294-2465 Web: http://yihui.name
Department of Statistics, Iowa State University
2215 Snedecor Hall, Ames, IA
On Mon, Mar 4, 2013 at 3:13 PM, Cook, Malcolm <MEC at stowers.org> wrote:
#
On Mon, Mar 04, 2013 at 05:04:25PM -0600, Yihui Xie wrote:
[...]
[...]

A new major release number might be the right point to switch
from svn to git.

Branch-and-merge made easy :-)


Ciao,
   Oliver
#
I hate to ask what go this thread started but it sounds like someone was counting on?
exact numeric reproducibility or was there a bug in a specific release? In actual?
fact, the best way to determine reproducibility is run the code in a variety of
packages. Alternatively, you can do everything in java and not assume?
that calculations commute or associate as the code is modified but it seems
pointless. Sensitivity determination would seem to lead to more reprodicible results
than trying to keep a specific set of code quirks.

I also seem to recall that FPU may have random lower order bits in some cases,
same code/data give different results. Alsways assume FP is stochastic and plan
on anlayzing the "noise."


----------------------------------------
#
.>>> * where do the dragons lurk
 .>>>
 .>>
 .>> webs of interconnected dynamically loaded libraries, identical versions of
 .>> R compiled with different BLAS/LAPACK options, etc.  Go with the VM if you
 .>> really, truly, want this level of exact reproducibility.
 .>
 .> Sounds like the best bet -- maybe tools like vagrant might be useful here:
 .>
 .> http://www.vagrantup.com
 .>
 .> ... or maybe they're overkill?
 .>
 .> Haven't really checked it out myself too much, my impression is that
 .> these tools (vagrant, chef, puppet) are built to handle such cases.
 .>
 .> I'd imagine you'd probably need a location where you can grab the
 .> precise (versioned) packages for the things you are specifying, but
 .
 .Right...and this is a bit tricky, because we don't keep old versions
 .around in our BioC software repositories.  They are available through
 .Subversion but with the sometimes additional overhead of setting up
 .build-time dependencies.


So, even if I wanted to go where dragons lurked, it would not be possible to cobble a version of biocLite that installed specific versions of software.

Thus, I might rather consider an approach that at 'publish' time tarzips up a copy of the R package dependencies based on a config file defined from sessionInfo and caches it in the project directory.

Then when/if the project is revisited (and found to produce differnt results under current R enviRonment),  I can "simply" install an old R (oops, I guess I'd have to build it), and then un-tarzip the dependencies into the projects own R/Library which I would put on .libpaths.

Or, or?  

(My virtual machine advocating colleagues are snickering now, I am sure......)

Thanks for all your thoughts and advices....

--Malcolm

 .
 .
 .> ...
 .>
 .> -steve
 .>
 .> --
 .> Steve Lianoglou
 .> Graduate Student: Computational Systems Biology
 .>  | Memorial Sloan-Kettering Cancer Center
 .>  | Weill Medical College of Cornell University
 .> Contact Info: http://cbio.mskcc.org/~lianos/contact
 .>
 .> ______________________________________________
 .> R-devel at r-project.org mailing list
 .> https://stat.ethz.ch/mailman/listinfo/r-devel
 .
 ._______________________________________________
 .Bioconductor mailing list
 .Bioconductor at r-project.org
 .https://stat.ethz.ch/mailman/listinfo/bioconductor
 .Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
#
All,

What got me started on this line of inquiry was my attempt at balancing the advantages of performing a periodic (daily or weekly) update to the 'release' version of locally installed R/Bioconductor packages on our institute-wide installation of R with the disadvantages of potentially changing the result of an analyst's workflow in mid-project.

I just got the "green light" to institute such periodic updates that I have been arguing is in our collective best interest.  In return,  I promised my best effort to provide a means for preserving or reverting to a working R library configuration.

Please note that the reproducibility I am most eager to provide is limited to reproducibility within the computing environment of our institute, which perhaps takes away some of the dragon's nests, though certainly not all.

There are technical issues of updating package installations on an NFS mount that might have files/libraries open on it from running R sessions.  I am interested in learning of approaches for minimizing/eliminating exposure to these issue as well.  The first/best approach seems to be to institute a 'black out' period when users should expect the installed library to change.   Perhaps there are improvements to this????

Best,

Malcolm


 .-----Original Message-----
 .From: Mike Marchywka [mailto:marchywka at hotmail.com]
 .Sent: Tuesday, March 05, 2013 5:24 AM
 .To: amackey at virginia.edu; Cook, Malcolm
 .Cc: r-devel at r-project.org; bioconductor at r-project.org; r-discussion at listserv.stowers.org
 .Subject: RE: [Rd] [BioC] enabling reproducible research & R package management & install.package.version & BiocLite
 .
 .
 .I hate to ask what go this thread started but it sounds like someone was counting on
 .exact numeric reproducibility or was there a bug in a specific release? In actual
 .fact, the best way to determine reproducibility is run the code in a variety of
 .packages. Alternatively, you can do everything in java and not assume
 .that calculations commute or associate as the code is modified but it seems
 .pointless. Sensitivity determination would seem to lead to more reprodicible results
 .than trying to keep a specific set of code quirks.
 .
 .I also seem to recall that FPU may have random lower order bits in some cases,
 .same code/data give different results. Alsways assume FP is stochastic and plan
 .on anlayzing the "noise."
 .
 .
 .----------------------------------------
 .> From: amackey at virginia.edu
 .> Date: Mon, 4 Mar 2013 16:28:48 -0500
 .> To: MEC at stowers.org
 .> CC: r-devel at r-project.org; bioconductor at r-project.org; r-discussion at listserv.stowers.org
 .> Subject: Re: [Rd] [BioC] enabling reproducible research & R package management & install.package.version & BiocLite
 .>
.> On Mon, Mar 4, 2013 at 4:13 PM, Cook, Malcolm <MEC at stowers.org> wrote:
.>
 .> > * where do the dragons lurk
 .> >
 .>
 .> webs of interconnected dynamically loaded libraries, identical versions of
 .> R compiled with different BLAS/LAPACK options, etc. Go with the VM if you
 .> really, truly, want this level of exact reproducibility.
 .>
 .> An alternative (and arguably more useful) strategy would be to cache
 .> results of each computational step, and report when results differ upon
 .> re-execution with identical inputs; if you cache sessionInfo along with
 .> each result, you can identify which package(s) changed, and begin to hunt
 .> down why the change occurred (possibly for the better); couple this with
 .> the concept of keeping both code *and* results in version control, then you
 .> can move forward with a (re)analysis without being crippled by out-of-date
 .> software.
 .>
 .> -Aaron
 .>
 .> --
 .> Aaron J. Mackey, PhD
 .> Assistant Professor
 .> Center for Public Health Genomics
 .> University of Virginia
 .> amackey at virginia.edu
 .> http://www.cphg.virginia.edu/mackey
 .>
 .> [[alternative HTML version deleted]]
 .>
 .> ______________________________________________
 .> R-devel at r-project.org mailing list
 .> https://stat.ethz.ch/mailman/listinfo/r-devel
 .
#
On 5 Mar 2013, at 14:36, Cook, Malcolm wrote:

            
Sounds a little like this:

http://cran.r-project.org/web/packages/rbundler/index.html

(which I haven't tested). Best,

Greg.

--
PLEASE NOTE CHANGE OF CONTACT DETAILS FROM MON 4TH MARCH:

Gregory Jefferis, PhD                   Tel: 01223 267048
Division of Neurobiology
MRC Laboratory of Molecular Biology
Francis Crick Avenue
Cambridge Biomedical Campus
Cambridge, CB2 OQH, UK

http://www2.mrc-lmb.cam.ac.uk/group-leaders/h-to-m/g-jefferis
http://jefferislab.org
http://flybrain.stanford.edu
#
On Tue, 5 Mar 2013, Cook, Malcolm wrote:
If you had a separate environment for every project, each with its own R 
installation and R installation lib.loc this becomes rather easy. For 
instance, something like this:

myProject/
    projectRInstallation/
       bin/
         R
       library/
         Biobase
         annotate
         .....
       ....
    projectData/
    projectCode/
    projectOutput/

The directory structure would likely be more complicated than that but 
something along those lines. This way all code, data *and* compute 
environment are always linked together.

-J
#
.> So, even if I wanted to go where dragons lurked, it would not be
 .> possible to cobble a version of biocLite that installed specific
 .> versions of software.
 .>
 .> Thus, I might rather consider an approach that at 'publish' time
 .> tarzips up a copy of the R package dependencies based on a config file
 .> defined from sessionInfo and caches it in the project directory.
 .>
 .> Then when/if the project is revisited (and found to produce differnt
 .> results under current R enviRonment),  I can "simply" install an old R
 .> (oops, I guess I'd have to build it), and then un-tarzip the
 .> dependencies into the projects own R/Library which I would put on
 .> .libpaths.
 .
 .Sounds a little like this:
 .
 .http://cran.r-project.org/web/packages/rbundler/index.html
 .
 .(which I haven't tested). Best,
 .
 .Greg.

Looks interesting - thanks for the suggestion.

But, but.... my use case is one in which an analyst at my site depends upon the local library installation and only retrospectively, at some publishable event (like handing the results over the in-house customer/scientist), seeks to ensure the ability to return to that exact R library environment  later.  This tool, on the other hand, commits the user to keep a project specific "bundle" from the outset.  Another set of trade-offs.  I will have to synthesize the options I am learning.....

~ Malcolm
#
One comment: I have found numerical changes due to updates to the OS's 
compilers or runtime at least as often as I have been by changes in R or 
packages when trying to reproduce results from a year or two back.  That 
aspect is rarely mentioned in these discussions.
On 05/03/2013 15:09, Cook, Malcolm wrote:

  
    
#
(More on the original question further below.)
On 13-03-05 09:48 AM, Cook, Malcolm wrote:
I have implemented a strategy to try to address this as follows:

1/ Install a new version of R when it is released, and packages in the R 
version's site-library with package versions as available at the time 
the R version is installed. Only upgrade these package versions in the 
case they are severely broken.

2/ Install the same packages in site-library-fresh and upgrade these 
package versions on a regular basis (e.g. daily).

3/ When a new version of R is released, freeze but do not remove the old 
R version, at least not for a fairly long time, and freeze 
site-library-fresh for the old version. Begin with the new version as in 
1/ and 2/. The old version remains available, so "reverting" is trivial.


The analysts are then responsible for choosing the R version they use, 
and the library they use. This means they do not have to change R and 
package version mid-project, but they can if they wish. I think the 
above two libraries will cover most cases, but it is possible that a few 
projects will need their own special library with a combination of 
package versions. In this case the user could create their own library, 
or you might prefer some more official mechanism.

The idea of the above strategy is to provide the stability one might 
want for an ongoing project, and the possibility of an upgraded package 
if necessary, but not encourage analysts to remain indefinitely with old 
versions (by say, putting new packages in an old R version library).

This strategy has been implemented in a set of make files in the project 
RoboAdmin available at http://automater.r-forge.r-project.org/. It can 
be done entirely automatically with a cron job. Constructive comments 
are always appreciated.

(IT departments sometimes think that there should be only one version of 
everything available, which they test and approve. So the initial 
reaction to this approach could be negative. I think they have not 
really thought about the advantages. They usually cannot test/approve an 
upgrade without user input, and timing is often extremely complicate 
because of ongoing user needs. This strategy is simply shifting 
responsibility and timing to the users, or user departments, that can 
actually do the testing and approving.)

Regarding NFS mounts, it is relatively robust. There can be occasional 
problems, especially for users that have a habit of keeping an R session 
open for days at a time and using site-library-fresh packages. In my 
experience this did not happen often enough to worry about a "blackout 
period".

Regarding the original question, I would like to think it could be 
possible to keep enough information to reproduce the exact environment, 
but I think for potentially sensitive numerical problems that is 
optimistic. As others have pointed out, results can depend not only on R 
and package versions, configuration, OS versions, and library and 
compiler versions, but also on the underlying hardware. You might have 
some hope using something like an Amazon core instance. (BTW, this 
problem is not specific to R.)

It is true that restricting to a fixed computing environment at your 
institution may ease things somewhat, but if you occasionally upgrade 
hardware or the OS then you will probably lose reproducibility.

An alternative that I recommend is that you produce a set of tests that 
confirm the results of any important project. These can be conveniently 
put in the tests/ directory of an R package, which is then maintained 
local, not on CRAN, and built/tested whenever a new R and packages are 
installed. (Tools for this are also available at the above indicated web 
site.) This approach means that you continue to reproduce the old 
results, or if not, discover differences/problems in the old or new 
version of R and/or packages that may be important to you. I have been 
successfully using a variant of this since about 1993, using R and 
package tests/ since they became available.

Paul
#
Hi Paul,

You outline some great suggestions!

I just wanted to point that in this case:
On Tue, Mar 5, 2013 at 5:34 PM, Paul Gilbert <pgilbert902 at gmail.com> wrote:
[snip]
if users have a habit of working like this, they could also create an
R-library directory under their home directory, and put this library
path at the front of their .libPaths() so the continually updated
"fresh" stuff won't affect them.

Just wanted to point that out as I really like your general approach
you've outlined, and just wanted to point out that there's an easy
work around in case someone else tries to institute such a regime but
is getting friction due to that point in particular.

Good stuff, though .. thanks for sharing that!

-steve
#
Paul,

I think your balanced and reasoned approach addresses all my current concerns.  Nice!  I will likely adopt your methods.  Let me ruminate.  Thanks for this.

~ Malcolm

 .-----Original Message-----
 .From: Paul Gilbert [mailto:pgilbert902 at gmail.com]
 .Sent: Tuesday, March 05, 2013 4:34 PM
 .To: Cook, Malcolm
 .Cc: 'r-devel at r-project.org'; 'bioconductor at r-project.org'; 'r-discussion at listserv.stowers.org'
 .Subject: Re: [Rd] [BioC] enabling reproducible research & R package management & install.package.version & BiocLite
 .
 .(More on the original question further below.)
 .
.On 13-03-05 09:48 AM, Cook, Malcolm wrote:
.> All,
 .>
 .> What got me started on this line of inquiry was my attempt at
 .> balancing the advantages of performing a periodic (daily or weekly)
 .> update to the 'release' version of locally installed R/Bioconductor
 .> packages on our institute-wide installation of R with the
 .> disadvantages of potentially changing the result of an analyst's
 .> workflow in mid-project.
 .
 .I have implemented a strategy to try to address this as follows:
 .
 .1/ Install a new version of R when it is released, and packages in the R
 .version's site-library with package versions as available at the time
 .the R version is installed. Only upgrade these package versions in the
 .case they are severely broken.
 .
 .2/ Install the same packages in site-library-fresh and upgrade these
 .package versions on a regular basis (e.g. daily).
 .
 .3/ When a new version of R is released, freeze but do not remove the old
 .R version, at least not for a fairly long time, and freeze
 .site-library-fresh for the old version. Begin with the new version as in
 .1/ and 2/. The old version remains available, so "reverting" is trivial.
 .
 .
 .The analysts are then responsible for choosing the R version they use,
 .and the library they use. This means they do not have to change R and
 .package version mid-project, but they can if they wish. I think the
 .above two libraries will cover most cases, but it is possible that a few
 .projects will need their own special library with a combination of
 .package versions. In this case the user could create their own library,
 .or you might prefer some more official mechanism.
 .
 .The idea of the above strategy is to provide the stability one might
 .want for an ongoing project, and the possibility of an upgraded package
 .if necessary, but not encourage analysts to remain indefinitely with old
 .versions (by say, putting new packages in an old R version library).
 .
 .This strategy has been implemented in a set of make files in the project
 .RoboAdmin available at http://automater.r-forge.r-project.org/. It can
 .be done entirely automatically with a cron job. Constructive comments
 .are always appreciated.
 .
 .(IT departments sometimes think that there should be only one version of
 .everything available, which they test and approve. So the initial
 .reaction to this approach could be negative. I think they have not
 .really thought about the advantages. They usually cannot test/approve an
 .upgrade without user input, and timing is often extremely complicate
 .because of ongoing user needs. This strategy is simply shifting
 .responsibility and timing to the users, or user departments, that can
 .actually do the testing and approving.)
 .
 .Regarding NFS mounts, it is relatively robust. There can be occasional
 .problems, especially for users that have a habit of keeping an R session
 .open for days at a time and using site-library-fresh packages. In my
 .experience this did not happen often enough to worry about a "blackout
 .period".
 .
 .Regarding the original question, I would like to think it could be
 .possible to keep enough information to reproduce the exact environment,
 .but I think for potentially sensitive numerical problems that is
 .optimistic. As others have pointed out, results can depend not only on R
 .and package versions, configuration, OS versions, and library and
 .compiler versions, but also on the underlying hardware. You might have
 .some hope using something like an Amazon core instance. (BTW, this
 .problem is not specific to R.)
 .
 .It is true that restricting to a fixed computing environment at your
 .institution may ease things somewhat, but if you occasionally upgrade
 .hardware or the OS then you will probably lose reproducibility.
 .
 .An alternative that I recommend is that you produce a set of tests that
 .confirm the results of any important project. These can be conveniently
 .put in the tests/ directory of an R package, which is then maintained
 .local, not on CRAN, and built/tested whenever a new R and packages are
 .installed. (Tools for this are also available at the above indicated web
 .site.) This approach means that you continue to reproduce the old
 .results, or if not, discover differences/problems in the old or new
 .version of R and/or packages that may be important to you. I have been
 .successfully using a variant of this since about 1993, using R and
 .package tests/ since they became available.
 .
 .Paul
 .
 .>
 .> I just got the "green light" to institute such periodic updates that
 .> I have been arguing is in our collective best interest.  In return,
 .> I promised my best effort to provide a means for preserving or
 .> reverting to a working R library configuration.
 .>
 .> Please note that the reproducibility I am most eager to provide is
 .> limited to reproducibility within the computing environment of our
 .> institute, which perhaps takes away some of the dragon's nests,
 .> though certainly not all.
 .>
 .> There are technical issues of updating package installations on an
 .> NFS mount that might have files/libraries open on it from running R
 .> sessions.  I am interested in learning of approaches for
 .> minimizing/eliminating exposure to these issue as well.  The
 .> first/best approach seems to be to institute a 'black out' period
 .> when users should expect the installed library to change.   Perhaps
 .> there are improvements to this????
 .>
 .> Best,
 .>
 .> Malcolm
 .>
 .>
 .> .-----Original Message----- .From: Mike Marchywka
 .> [mailto:marchywka at hotmail.com] .Sent: Tuesday, March 05, 2013 5:24
 .> AM .To: amackey at virginia.edu; Cook, Malcolm .Cc:
 .> r-devel at r-project.org; bioconductor at r-project.org;
 .> r-discussion at listserv.stowers.org .Subject: RE: [Rd] [BioC] enabling
 .> reproducible research & R package management &
 .> install.package.version & BiocLite . . .I hate to ask what go this
 .> thread started but it sounds like someone was counting on .exact
 .> numeric reproducibility or was there a bug in a specific release? In
 .> actual .fact, the best way to determine reproducibility is run the
 .> code in a variety of .packages. Alternatively, you can do everything
 .> in java and not assume .that calculations commute or associate as the
 .> code is modified but it seems .pointless. Sensitivity determination
 .> would seem to lead to more reprodicible results .than trying to keep
 .> a specific set of code quirks. . .I also seem to recall that FPU may
 .> have random lower order bits in some cases, .same code/data give
 .> different results. Alsways assume FP is stochastic and plan .on
 .> anlayzing the "noise." . . .----------------------------------------
 .> .> From: amackey at virginia.edu .> Date: Mon, 4 Mar 2013 16:28:48
 .> -0500 .> To: MEC at stowers.org .> CC: r-devel at r-project.org;
 .> bioconductor at r-project.org; r-discussion at listserv.stowers.org .>
 .> Subject: Re: [Rd] [BioC] enabling reproducible research & R package
 .> management & install.package.version & BiocLite .> .> On Mon, Mar 4,
 .> 2013 at 4:13 PM, Cook, Malcolm <MEC at stowers.org> wrote: .> .> > *
 .> where do the dragons lurk .> > .> .> webs of interconnected
 .> dynamically loaded libraries, identical versions of .> R compiled
 .> with different BLAS/LAPACK options, etc. Go with the VM if you .>
 .> really, truly, want this level of exact reproducibility. .> .> An
 .> alternative (and arguably more useful) strategy would be to cache .>
 .> results of each computational step, and report when results differ
 .> upon .> re-execution with identical inputs; if you cache sessionInfo
 .> along with .> each result, you can identify which package(s) changed,
 .> and begin to hunt .> down why the change occurred (possibly for the
 .> better); couple this with .> the concept of keeping both code *and*
 .> results in version control, then you .> can move forward with a
 .> (re)analysis without being crippled by out-of-date .> software. .> .>
 .> -Aaron .> .> -- .> Aaron J. Mackey, PhD .> Assistant Professor .>
 .> Center for Public Health Genomics .> University of Virginia .>
 .> amackey at virginia.edu .> http://www.cphg.virginia.edu/mackey .> .>
 .> [[alternative HTML version deleted]] .> .>
 .> ______________________________________________ .>
 .> R-devel at r-project.org mailing list .>
 .> https://stat.ethz.ch/mailman/listinfo/r-devel .
 .>
 .> ______________________________________________ R-devel at r-project.org
 .> mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
 .>
#
There are utilities ( e.g. dotkit, and modules) which facilitate version management, basically creating on the fly PATH and env setups, if you are comfortable keeping all that around. 

David

-----Original Message-----
From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Cook, Malcolm
Sent: Tuesday, March 05, 2013 6:08 PM
To: 'Paul Gilbert'
Cc: 'r-devel at r-project.org'; 'bioconductor at r-project.org'; 'r-discussion at listserv.stowers.org'
Subject: Re: [BioC] [Rd] enabling reproducible research & R package management & install.package.version & BiocLite

Paul,

I think your balanced and reasoned approach addresses all my current concerns.  Nice!  I will likely adopt your methods.  Let me ruminate.  Thanks for this.

~ Malcolm

 .-----Original Message-----
 .From: Paul Gilbert [mailto:pgilbert902 at gmail.com]
 .Sent: Tuesday, March 05, 2013 4:34 PM
 .To: Cook, Malcolm
 .Cc: 'r-devel at r-project.org'; 'bioconductor at r-project.org'; 'r-discussion at listserv.stowers.org'
 .Subject: Re: [Rd] [BioC] enabling reproducible research & R package management & install.package.version & BiocLite  .
 .(More on the original question further below.)  .
.On 13-03-05 09:48 AM, Cook, Malcolm wrote:
.> All,
 .>
 .> What got me started on this line of inquiry was my attempt at  .> balancing the advantages of performing a periodic (daily or weekly)  .> update to the 'release' version of locally installed R/Bioconductor  .> packages on our institute-wide installation of R with the  .> disadvantages of potentially changing the result of an analyst's  .> workflow in mid-project.
 .
 .I have implemented a strategy to try to address this as follows:
 .
 .1/ Install a new version of R when it is released, and packages in the R  .version's site-library with package versions as available at the time  .the R version is installed. Only upgrade these package versions in the  .case they are severely broken.
 .
 .2/ Install the same packages in site-library-fresh and upgrade these  .package versions on a regular basis (e.g. daily).
 .
 .3/ When a new version of R is released, freeze but do not remove the old  .R version, at least not for a fairly long time, and freeze  .site-library-fresh for the old version. Begin with the new version as in  .1/ and 2/. The old version remains available, so "reverting" is trivial.
 .
 .
 .The analysts are then responsible for choosing the R version they use,  .and the library they use. This means they do not have to change R and  .package version mid-project, but they can if they wish. I think the  .above two libraries will cover most cases, but it is possible that a few  .projects will need their own special library with a combination of  .package versions. In this case the user could create their own library,  .or you might prefer some more official mechanism.
 .
 .The idea of the above strategy is to provide the stability one might  .want for an ongoing project, and the possibility of an upgraded package  .if necessary, but not encourage analysts to remain indefinitely with old  .versions (by say, putting new packages in an old R version library).
 .
 .This strategy has been implemented in a set of make files in the project  .RoboAdmin available at http://automater.r-forge.r-project.org/. It can  .be done entirely automatically with a cron job. Constructive comments  .are always appreciated.
 .
 .(IT departments sometimes think that there should be only one version of  .everything available, which they test and approve. So the initial  .reaction to this approach could be negative. I think they have not  .really thought about the advantages. They usually cannot test/approve an  .upgrade without user input, and timing is often extremely complicate  .because of ongoing user needs. This strategy is simply shifting  .responsibility and timing to the users, or user departments, that can  .actually do the testing and approving.)  .
 .Regarding NFS mounts, it is relatively robust. There can be occasional  .problems, especially for users that have a habit of keeping an R session  .open for days at a time and using site-library-fresh packages. In my  .experience this did not happen often enough to worry about a "blackout  .period".
 .
 .Regarding the original question, I would like to think it could be  .possible to keep enough information to reproduce the exact environment,  .but I think for potentially sensitive numerical problems that is  .optimistic. As others have pointed out, results can depend not only on R  .and package versions, configuration, OS versions, and library and  .compiler versions, but also on the underlying hardware. You might have  .some hope using something like an Amazon core instance. (BTW, this  .problem is not specific to R.)  .
 .It is true that restricting to a fixed computing environment at your  .institution may ease things somewhat, but if you occasionally upgrade  .hardware or the OS then you will probably lose reproducibility.
 .
 .An alternative that I recommend is that you produce a set of tests that  .confirm the results of any important project. These can be conveniently  .put in the tests/ directory of an R package, which is then maintained  .local, not on CRAN, and built/tested whenever a new R and packages are  .installed. (Tools for this are also available at the above indicated web
 .site.) This approach means that you continue to reproduce the old  .results, or if not, discover differences/problems in the old or new  .version of R and/or packages that may be important to you. I have been  .successfully using a variant of this since about 1993, using R and  .package tests/ since they became available.
 .
 .Paul
 .
 .>
 .> I just got the "green light" to institute such periodic updates that  .> I have been arguing is in our collective best interest.  In return,  .> I promised my best effort to provide a means for preserving or  .> reverting to a working R library configuration.
 .>
 .> Please note that the reproducibility I am most eager to provide is  .> limited to reproducibility within the computing environment of our  .> institute, which perhaps takes away some of the dragon's nests,  .> though certainly not all.
 .>
 .> There are technical issues of updating package installations on an  .> NFS mount that might have files/libraries open on it from running R  .> sessions.  I am interested in learning of approaches for  .> minimizing/eliminating exposure to these issue as well.  The  .> first/best approach seems to be to institute a 'black out' period
 .> when users should expect the installed library to change.   Perhaps
 .> there are improvements to this????
 .>
 .> Best,
 .>
 .> Malcolm
 .>
 .>
 .> .-----Original Message----- .From: Mike Marchywka  .> [mailto:marchywka at hotmail.com] .Sent: Tuesday, March 05, 2013 5:24  .> AM .To: amackey at virginia.edu; Cook, Malcolm .Cc:
 .> r-devel at r-project.org; bioconductor at r-project.org;  .> r-discussion at listserv.stowers.org .Subject: RE: [Rd] [BioC] enabling  .> reproducible research & R package management &  .> install.package.version & BiocLite . . .I hate to ask what go this  .> thread started but it sounds like someone was counting on .exact  .> numeric reproducibility or was there a bug in a specific release? In  .> actual .fact, the best way to determine reproducibility is run the  .> code in a variety of .packages. Alternatively, you can do everything  .> in java and not assume .that calculations commute or associate as the  .> code is modified but it seems .pointless. Sensitivity determination  .> would seem to lead to more reprodicible results .than trying to keep  .> a specific set of code quirks. . .I also seem to recall that FPU may  .> have random lower order bits in some cases, .same code/data give  .> different results. Alsways assume FP is stochastic and plan .on  .> anlayzing the "noise." . . .----------------------------------------
 .> .> From: amackey at virginia.edu .> Date: Mon, 4 Mar 2013 16:28:48  .> -0500 .> To: MEC at stowers.org .> CC: r-devel at r-project.org;  .> bioconductor at r-project.org; r-discussion at listserv.stowers.org .>  .> Subject: Re: [Rd] [BioC] enabling reproducible research & R package  .> management & install.package.version & BiocLite .> .> On Mon, Mar 4,  .> 2013 at 4:13 PM, Cook, Malcolm <MEC at stowers.org> wrote: .> .> > *  .> where do the dragons lurk .> > .> .> webs of interconnected  .> dynamically loaded libraries, identical versions of .> R compiled  .> with different BLAS/LAPACK options, etc. Go with the VM if you .>  .> really, truly, want this level of exact reproducibility. .> .> An  .> alternative (and arguably more useful) strategy would be to cache .>  .> results of each computational step, and report when results differ  .> upon .> re-execution with identical inputs; if you cache sessionInfo  .> along with .> each result, you can identify which package(s) changed,  .> and begin to hunt .> down why the change occurred (possibly for the  .> better); couple this with .> the concept of keeping both code *and*  .> results in version control, then you .> can move forward with a  .> (re)analysis without being crippled by out-of-date .> software. .> .>  .> -Aaron .> .> -- .> Aaron J. Mackey, PhD .> Assistant Professor .>  .> Center for Public Health Genomics .> University of Virginia .>  .> amackey at virginia.edu .> http://www.cphg.virginia.edu/mackey .> .>  .> [[alternative HTML version deleted]] .> .>  .> ______________________________________________ .>  .> R-devel at r-project.org mailing list .>  .> https://stat.ethz.ch/mailman/listinfo/r-devel .
 .>
 .> ______________________________________________ R-devel at r-project.org  .> mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
 .>

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
#
Thanks David, I've looked into them both a bit, and I don't think the provide an approach for R (or Perl, for that matter) library management, which is the wicket I'm trying to get less sticky now.

They could be useful to manage the various installations of version of R and analysis files (we're talking allot of NextGenSequencing, so, bowtie, tophat, and friends) quite nicely similarly in service of an approach to enabling reproducible results.

THanks for you thoughts, and, if you know of others similar to dotkit/modules I'd be keen to here of them.

~Malcolm


 .-----Original Message-----
 .From: Lapointe, David [mailto:David.Lapointe at umassmed.edu]
 .Sent: Wednesday, March 06, 2013 7:46 AM
 .To: Cook, Malcolm; 'Paul Gilbert'
 .Cc: 'r-devel at r-project.org'; 'bioconductor at r-project.org'; 'r-discussion at listserv.stowers.org'
 .Subject: RE: [BioC] [Rd] enabling reproducible research & R package management & install.package.version & BiocLite
 .
 .There are utilities ( e.g. dotkit, and modules) which facilitate version management, basically creating on the fly PATH and env setups, if
 .you are comfortable keeping all that around.
 .
 .David
 .
 .-----Original Message-----
 .From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Cook, Malcolm
 .Sent: Tuesday, March 05, 2013 6:08 PM
 .To: 'Paul Gilbert'
 .Cc: 'r-devel at r-project.org'; 'bioconductor at r-project.org'; 'r-discussion at listserv.stowers.org'
 .Subject: Re: [BioC] [Rd] enabling reproducible research & R package management & install.package.version & BiocLite
 .
 .Paul,
 .
 .I think your balanced and reasoned approach addresses all my current concerns.  Nice!  I will likely adopt your methods.  Let me
 .ruminate.  Thanks for this.
 .
 .~ Malcolm
 .
 . .-----Original Message-----
 . .From: Paul Gilbert [mailto:pgilbert902 at gmail.com]
 . .Sent: Tuesday, March 05, 2013 4:34 PM
 . .To: Cook, Malcolm
 . .Cc: 'r-devel at r-project.org'; 'bioconductor at r-project.org'; 'r-discussion at listserv.stowers.org'
 . .Subject: Re: [Rd] [BioC] enabling reproducible research & R package management & install.package.version & BiocLite  .
 . .(More on the original question further below.)  .
. .On 13-03-05 09:48 AM, Cook, Malcolm wrote:
. .> All,
 . .>
 . .> What got me started on this line of inquiry was my attempt at  .> balancing the advantages of performing a periodic (daily or
 .weekly)  .> update to the 'release' version of locally installed R/Bioconductor  .> packages on our institute-wide installation of R with
 .the  .> disadvantages of potentially changing the result of an analyst's  .> workflow in mid-project.
 . .
 . .I have implemented a strategy to try to address this as follows:
 . .
 . .1/ Install a new version of R when it is released, and packages in the R  .version's site-library with package versions as available at the
 .time  .the R version is installed. Only upgrade these package versions in the  .case they are severely broken.
 . .
 . .2/ Install the same packages in site-library-fresh and upgrade these  .package versions on a regular basis (e.g. daily).
 . .
 . .3/ When a new version of R is released, freeze but do not remove the old  .R version, at least not for a fairly long time, and freeze
 ..site-library-fresh for the old version. Begin with the new version as in  .1/ and 2/. The old version remains available, so "reverting" is
 .trivial.
 . .
 . .
 . .The analysts are then responsible for choosing the R version they use,  .and the library they use. This means they do not have to
 .change R and  .package version mid-project, but they can if they wish. I think the  .above two libraries will cover most cases, but it is
 .possible that a few  .projects will need their own special library with a combination of  .package versions. In this case the user could
 .create their own library,  .or you might prefer some more official mechanism.
 . .
 . .The idea of the above strategy is to provide the stability one might  .want for an ongoing project, and the possibility of an upgraded
 .package  .if necessary, but not encourage analysts to remain indefinitely with old  .versions (by say, putting new packages in an old R
 .version library).
 . .
 . .This strategy has been implemented in a set of make files in the project  .RoboAdmin available at http://automater.r-forge.r-
 .project.org/. It can  .be done entirely automatically with a cron job. Constructive comments  .are always appreciated.
 . .
 . .(IT departments sometimes think that there should be only one version of  .everything available, which they test and approve. So
 .the initial  .reaction to this approach could be negative. I think they have not  .really thought about the advantages. They usually
 .cannot test/approve an  .upgrade without user input, and timing is often extremely complicate  .because of ongoing user needs. This
 .strategy is simply shifting  .responsibility and timing to the users, or user departments, that can  .actually do the testing and
 .approving.)  .
 . .Regarding NFS mounts, it is relatively robust. There can be occasional  .problems, especially for users that have a habit of keeping an
 .R session  .open for days at a time and using site-library-fresh packages. In my  .experience this did not happen often enough to worry
 .about a "blackout  .period".
 . .
 . .Regarding the original question, I would like to think it could be  .possible to keep enough information to reproduce the exact
 .environment,  .but I think for potentially sensitive numerical problems that is  .optimistic. As others have pointed out, results can
 .depend not only on R  .and package versions, configuration, OS versions, and library and  .compiler versions, but also on the
 .underlying hardware. You might have  .some hope using something like an Amazon core instance. (BTW, this  .problem is not specific
 .to R.)  .
 . .It is true that restricting to a fixed computing environment at your  .institution may ease things somewhat, but if you occasionally
 .upgrade  .hardware or the OS then you will probably lose reproducibility.
 . .
 . .An alternative that I recommend is that you produce a set of tests that  .confirm the results of any important project. These can be
 .conveniently  .put in the tests/ directory of an R package, which is then maintained  .local, not on CRAN, and built/tested whenever a
 .new R and packages are  .installed. (Tools for this are also available at the above indicated web
 . .site.) This approach means that you continue to reproduce the old  .results, or if not, discover differences/problems in the old or new
 ..version of R and/or packages that may be important to you. I have been  .successfully using a variant of this since about 1993, using R
 .and  .package tests/ since they became available.
 . .
 . .Paul
 . .
 . .>
 . .> I just got the "green light" to institute such periodic updates that  .> I have been arguing is in our collective best interest.  In return,
 ..> I promised my best effort to provide a means for preserving or  .> reverting to a working R library configuration.
 . .>
 . .> Please note that the reproducibility I am most eager to provide is  .> limited to reproducibility within the computing environment of
 .our  .> institute, which perhaps takes away some of the dragon's nests,  .> though certainly not all.
 . .>
 . .> There are technical issues of updating package installations on an  .> NFS mount that might have files/libraries open on it from
 .running R  .> sessions.  I am interested in learning of approaches for  .> minimizing/eliminating exposure to these issue as well.  The  .>
 .first/best approach seems to be to institute a 'black out' period
 . .> when users should expect the installed library to change.   Perhaps
 . .> there are improvements to this????
 . .>
 . .> Best,
 . .>
 . .> Malcolm
 . .>
 . .>
 . .> .-----Original Message----- .From: Mike Marchywka  .> [mailto:marchywka at hotmail.com] .Sent: Tuesday, March 05, 2013 5:24  .>
 .AM .To: amackey at virginia.edu; Cook, Malcolm .Cc:
 . .> r-devel at r-project.org; bioconductor at r-project.org;  .> r-discussion at listserv.stowers.org .Subject: RE: [Rd] [BioC] enabling  .>
 .reproducible research & R package management &  .> install.package.version & BiocLite . . .I hate to ask what go this  .> thread started
 .but it sounds like someone was counting on .exact  .> numeric reproducibility or was there a bug in a specific release? In  .> actual
 ..fact, the best way to determine reproducibility is run the  .> code in a variety of .packages. Alternatively, you can do everything  .> in
 .java and not assume .that calculations commute or associate as the  .> code is modified but it seems .pointless. Sensitivity
 .determination  .> would seem to lead to more reprodicible results .than trying to keep  .> a specific set of code quirks. . .I also seem to
 .recall that FPU may  .> have random lower order bits in some cases, .same code/data give  .> different results. Alsways assume FP is
 .stochastic and plan .on  .> anlayzing the "noise." . . .----------------------------------------
 . .> .> From: amackey at virginia.edu .> Date: Mon, 4 Mar 2013 16:28:48  .> -0500 .> To: MEC at stowers.org .> CC: r-devel at r-project.org;
 ..> bioconductor at r-project.org; r-discussion at listserv.stowers.org .>  .> Subject: Re: [Rd] [BioC] enabling reproducible research & R
 .package  .> management & install.package.version & BiocLite .> .> On Mon, Mar 4,  .> 2013 at 4:13 PM, Cook, Malcolm
 .<MEC at stowers.org> wrote: .> .> > *  .> where do the dragons lurk .> > .> .> webs of interconnected  .> dynamically loaded libraries,
 .identical versions of .> R compiled  .> with different BLAS/LAPACK options, etc. Go with the VM if you .>  .> really, truly, want this level
 .of exact reproducibility. .> .> An  .> alternative (and arguably more useful) strategy would be to cache .>  .> results of each
 .computational step, and report when results differ  .> upon .> re-execution with identical inputs; if you cache sessionInfo  .> along
 .with .> each result, you can identify which package(s) changed,  .> and begin to hunt .> down why the change occurred (possibly for
 .the  .> better); couple this with .> the concept of keeping both code *and*  .> results in version control, then you .> can move forward
 .with a  .> (re)analysis without being crippled by out-of-date .> software. .> .>  .> -Aaron .> .> -- .> Aaron J. Mackey, PhD .> Assistant
 .Professor .>  .> Center for Public Health Genomics .> University of Virginia .>  .> amackey at virginia.edu .>
 .http://www.cphg.virginia.edu/mackey .> .>  .> [[alternative HTML version deleted]] .> .>  .>
 .______________________________________________ .>  .> R-devel at r-project.org mailing list .>  .>
 .https://stat.ethz.ch/mailman/listinfo/r-devel .
 . .>
 . .> ______________________________________________ R-devel at r-project.org  .> mailing list
 .https://stat.ethz.ch/mailman/listinfo/r-devel
 . .>
 .
 ._______________________________________________
 .Bioconductor mailing list
 .Bioconductor at r-project.org
 .https://stat.ethz.ch/mailman/listinfo/bioconductor
 .Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor