The case for freezing CRAN

There is a central assertion to this argument that I don't follow:
At the end of the day most published results obtained with R just won't be reproducible.
This is a very strong assertion. What is the evidence for it?

  I write a lot of Sweave/knitr in house as a way of documenting complex analyses, and a 
glm() based logistic regression looks the same yesterday as it will tomorrow.

Terry Therneau

There is a central assertion to this argument that I don't follow:

At the end of the day most published results obtained with R just won't be reproducible.
This is a very strong assertion. What is the evidence for it?
If I've understood Jeroen correctly, his point might be alternatively phrased as "won't be reproducED" (i.e., end user difficulties, not software availability).

Michael
On Mar 20, 2014, at 8:19, "Therneau, Terry M., Ph.D." <therneau at mayo.edu> wrote:

There is a central assertion to this argument that I don't follow:

At the end of the day most published results obtained with R just won't be reproducible.
This is a very strong assertion. What is the evidence for it?
If I've understood Jeroen correctly, his point might be alternatively phrased as "won't be reproducED" (i.e., end user difficulties, not software availability).

Michael

That was my point as well.  Of the 30+ Sweave documents that I've produced I can't think 
of one that will change its output with a new version of R.  My 0/30 estimate is at odds 
with the "nearly all" assertion.  Perhaps I only do dull things?

Terry T.

On 03/20/2014 07:48 AM, Michael Weylandt wrote:
On Mar 20, 2014, at 8:19, "Therneau, Terry M., Ph.D." 
<therneau at mayo.edu> wrote:

There is a central assertion to this argument that I don't follow:

At the end of the day most published results obtained with R just 
won't be reproducible.
This is a very strong assertion. What is the evidence for it?
If I've understood Jeroen correctly, his point might be alternatively 
phrased as "won't be reproducED" (i.e., end user difficulties, not 
software availability).

Michael

That was my point as well.  Of the 30+ Sweave documents that I've 
produced I can't think of one that will change its output with a new 
version of R.  My 0/30 estimate is at odds with the "nearly all" 
assertion.  Perhaps I only do dull things?

Terry T.

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
The only concrete example that comes to mind from my own Sweave reports 
was actually caused by BioConductor and not CRAN. I had a set of 
analyses that used DNAcopy, and the results changed substantially with a 
new release of the package in which they changed the default values to 
the main function call.   As a result, I've taken to writing out more of 
the defaults that I previously just accepted.  There have been a few 
minor issues similar to this one (with changes to parts of the Mclust 
package ??). So my estimates are somewhat higher than 0/30 but are still 
a long way from "almost all".

Kevin
No attempt to summarize the thread, but a few highlighted points:

 o Karl's suggestion of versioned / dated access to the repo by adding a
   layer to webaccess is (as usual) nice.  It works on the 'supply' side. But
   Jeroen's problem is on the demand side.  Even when we know that an
   analysis was done on 20xx-yy-zz, and we reconstruct CRAN that day, it only
   gives us a 'ceiling' estimate of what was on the machine.  In production
   or lab environments, installations get stale.  Maybe packages were already
   a year old?  To me, this is an issue that needs to be addressed on the
   'demand' side of the user. But just writing out version numbers is not
   good enough.

 o Roger correctly notes that R scripts and packages are just one issue.
   Compilers, libraries and the OS matter.  To me, the natural approach these
   days would be to think of something based on Docker or Vagrant or (if you
   must, VirtualBox).  The newer alternatives make snapshotting very cheap
   (eg by using Linux LXC).  That approach reproduces a full environemnt as
   best as we can while still ignoring the hardware layer (and some readers
   may recall the infamous Pentium bug of two decades ago).

 o Reproduciblity will probably remain the responsibility of study
   authors. If an investigator on a mega-grant wants to (or needs to) freeze,
   they do have the tools now.  Requiring the need of a few to push work on
   those already overloaded (ie CRAN) and changing the workflow of everybody
   is a non-starter.

 o As Terry noted, Jeroen made some strong claims about exactly how flawed
   the existing system is and keeps coming back to the example of 'a JSS
   paper that cannot be re-run'.  I would really like to see empirics on
   this.  Studies of reproducibility appear to be publishable these days, so
   maybe some enterprising grad student wants to run with the idea of
   actually _testing_ this.  We maybe be above Terry's 0/30 and nearer to
   Kevin's 'low'/30.  But let's bring some data to the debate.

 o Overall, I would tend to think that our CRAN standards of releasing with
   tests, examples, and checks on every build and release already do a much
   better job of keeping things tidy and workable than in most if not all
   other related / similar open source projects. I would of course welcome
   contradictory examples.

Dirk
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
[snip]
     (and some readers
   may recall the infamous Pentium bug of two decades ago).
It was a "Flaw" not a "Bug".  At least I remember the Intel people
making a big deal about that distinction.

But I do remember the time well, I was a biostatistics Ph.D. student
at the time and bought one of the flawed pentiums.  My attempts at
getting the chip replaced resulted in a major run around and each
person that I talked to would first try to explain that I really did
not need the fix because the only people likely to be affected were
large corporations and research scientists.  I will admit that I was
not a large corporation, but if a Ph.D. student in biostatistics is
not a research scientist, then I did not know what they defined one
as.  When I pointed this out they would usually then say that it still
would not matter, unless I did a few thousand floating point
operations I was unlikely to encounter one of the problematic
divisions.  I would then point out that some days I did over 10,000
floating point operations before breakfast (I had checked after the
1st person told me this and 10,000 was a low estimate of a lower bound
of one set of simulations) at which point they would admit that I had
a case and then send me to talk to someone else who would start the
process over.

[snip]
--
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Gregory (Greg) L. Snow Ph.D.
538280 at gmail.com
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140320/13e19c44/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140320/891e5b92/attachment.pl>

On Thu, Mar 20, 2014 at 7:32 AM, Dirk Eddelbuettel <edd at debian.org> wrote:
[snip]

    (and some readers
  may recall the infamous Pentium bug of two decades ago).
It was a "Flaw" not a "Bug".  At least I remember the Intel people
making a big deal about that distinction.

But I do remember the time well, I was a biostatistics Ph.D. student
at the time and bought one of the flawed pentiums.  My attempts at
getting the chip replaced resulted in a major run around and each
person that I talked to would first try to explain that I really did
not need the fix because the only people likely to be affected were
large corporations and research scientists.  I will admit that I was
not a large corporation, but if a Ph.D. student in biostatistics is
not a research scientist, then I did not know what they defined one
as.  When I pointed this out they would usually then say that it still
would not matter, unless I did a few thousand floating point
operations I was unlikely to encounter one of the problematic
divisions.  I would then point out that some days I did over 10,000
floating point operations before breakfast (I had checked after the
1st person told me this and 10,000 was a low estimate of a lower bound
of one set of simulations) at which point they would admit that I had
a case and then send me to talk to someone else who would start the
process over.
Further segue:

That (1994) was a watershed moment for Intel as a company. A time during which Intel's future was quite literally at stake. Intel's internal response to that debacle, which fundamentally altered their own perception of just who their customer was (the OEM's like IBM, COMPAQ and Dell versus the end users like us), took time to be realized, as the impact of increasingly negative PR took hold. It was also a good example of the impact of public perception (a flawed product) versus the realities of how infrequently the flaw would be observed in "typical" computing. "Perception is reality", as some would observe.

Intel ultimately spent somewhere in the neighborhood of $500 million (in 1994 U.S. dollars), as I recall, to implement a large scale Pentium chip replacement infrastructure targeted to end users. The "Intel Inside" marketing campaign was also an outgrowth of that time period.

Regards,

Marc Schwartz
[snip]
--
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Gregory (Greg) L. Snow Ph.D.
538280 at gmail.com

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

On Mar 20, 2014, at 12:23 PM, Greg Snow <538280 at gmail.com> wrote:

On Thu, Mar 20, 2014 at 7:32 AM, Dirk Eddelbuettel <edd at debian.org> wrote:
[snip]

   (and some readers
 may recall the infamous Pentium bug of two decades ago).
It was a "Flaw" not a "Bug".  At least I remember the Intel people
making a big deal about that distinction.

But I do remember the time well, I was a biostatistics Ph.D. student
at the time and bought one of the flawed pentiums.  My attempts at
getting the chip replaced resulted in a major run around and each
person that I talked to would first try to explain that I really did
not need the fix because the only people likely to be affected were
large corporations and research scientists.  I will admit that I was
not a large corporation, but if a Ph.D. student in biostatistics is
not a research scientist, then I did not know what they defined one
as.  When I pointed this out they would usually then say that it still
would not matter, unless I did a few thousand floating point
operations I was unlikely to encounter one of the problematic
divisions.  I would then point out that some days I did over 10,000
floating point operations before breakfast (I had checked after the
1st person told me this and 10,000 was a low estimate of a lower bound
of one set of simulations) at which point they would admit that I had
a case and then send me to talk to someone else who would start the
process over.

Further segue:

That (1994) was a watershed moment for Intel as a company. A time during which Intel's future was quite literally at stake. Intel's internal response to that debacle, which fundamentally altered their own perception of just who their customer was (the OEM's like IBM, COMPAQ and Dell versus the end users like us), took time to be realized, as the impact of increasingly negative PR took hold. It was also a good example of the impact of public perception (a flawed product) versus the realities of how infrequently the flaw would be observed in "typical" computing. "Perception is reality", as some would observe.

Intel ultimately spent somewhere in the neighborhood of $500 million (in 1994 U.S. dollars), as I recall, to implement a large scale Pentium chip replacement infrastructure targeted to end users. The "Intel Inside" marketing campaign was also an outgrowth of that time period.

Quick correction, thanks to Peter, on my assertion that the "Intel Inside" campaign arose from the 1994 Pentium issue. It actually started in 1991.

I had a faulty recollection from my long ago reading of Andy Grove's 1996 book, "Only The Paranoid Survive", that the slogan arose from Intel's reaction to the Pentium fiasco. It actually pre-dated that time frame by a few years.

Thanks Peter!

Regards,

Marc
Dirk Eddelbuettel <edd at debian.org> writes:
 o Roger correctly notes that R scripts and packages are just one issue.
   Compilers, libraries and the OS matter.  To me, the natural approach these
   days would be to think of something based on Docker or Vagrant or (if you
   must, VirtualBox).  The newer alternatives make snapshotting very cheap
   (eg by using Linux LXC).  That approach reproduces a full environemnt as
   best as we can while still ignoring the hardware layer (and some readers
   may recall the infamous Pentium bug of two decades ago).
These two tools look very interesting - but I have, even after reading a
few discussions of their differences, no idea which one is better suited
to be used for what has been discussed here: Making it possible to run
the analysis later to reproduce results using the same versions used in
the initial analysis.

Am I right in saying:

- Vagrant uses VMs to emulate the hardware
- Docker does not

wherefore
- Vagrant is slower and requires more space
- Docker is faster and requires less space

Therefore, could one say that Vagrant is more "robust" in the long run?

How do they compare in relation to different platforms? Vagrant seems to
be platform agnostic, I can develop and run on Linux, Mac and Windows -
how does it work with Docker? 

I just followed [1] and setup Docker on OSX - loos promising - it also
uses an underlying VM. SO both should be equal in regards to
reproducability in the long run?

Please note: I see these questions in the light of this discussion of
reproducability and not in regards to deployment of applications what
the discussions on the web are.

Any comments, thoughts, remarks?

Rainer

Footnotes: 
[1]  http://docs.docker.io/en/latest/installation/mac/
Rainer M. Krug
email: Rainer<at>krugs<dot>de
PGP: 0x0F52F982
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 494 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140321/09332430/attachment.bin>
..............................................<?}))><........
 ) ) ) ) )
( ( ( ( (    Prof. Philippe Grosjean
 ) ) ) ) )
( ( ( ( (    Numerical Ecology of Aquatic Systems
 ) ) ) ) )   Mons University, Belgium
( ( ( ( (
..............................................................

Dirk Eddelbuettel <edd at debian.org> writes:

o Roger correctly notes that R scripts and packages are just one issue.
  Compilers, libraries and the OS matter.  To me, the natural approach these
  days would be to think of something based on Docker or Vagrant or (if you
  must, VirtualBox).  The newer alternatives make snapshotting very cheap
  (eg by using Linux LXC).  That approach reproduces a full environemnt as
  best as we can while still ignoring the hardware layer (and some readers
  may recall the infamous Pentium bug of two decades ago).
These two tools look very interesting - but I have, even after reading a
few discussions of their differences, no idea which one is better suited
to be used for what has been discussed here: Making it possible to run
the analysis later to reproduce results using the same versions used in
the initial analysis.

Am I right in saying:

- Vagrant uses VMs to emulate the hardware
- Docker does not

Yes.
wherefore
- Vagrant is slower and requires more space
- Docker is faster and requires less space

It depends. For instance, if you run R in VirtualBox under Windows, it may run faster depending on the code you run and, say, the Lapack library used. On Linux, you typically got R code run in the VM 2-3% slower than natively, but In a Windows host, most of my R code runs faster in the VM? But yes, you need more RAM.

With Vagrant, you do not need to keep you VM once you don't use it any more. Then, disk space is shrunk down to a few kB, corresponding to the Vagrant configuration file. I guess the same is true for Docker?

A big advantage of Vagrant + VirtualBox is that you got a very similar virtual hardware, no matter if your host system is Linux, Windows or Mac OS X. I see this as a good point for better reproducibility.
Therefore, could one say that Vagrant is more "robust" in the long run?

May be,? but it depends almost entirely how VirtualBox will support old VMs in the future!

PhG
How do they compare in relation to different platforms? Vagrant seems to
be platform agnostic, I can develop and run on Linux, Mac and Windows -
how does it work with Docker? 

I just followed [1] and setup Docker on OSX - loos promising - it also
uses an underlying VM. SO both should be equal in regards to
reproducability in the long run?

Please note: I see these questions in the light of this discussion of
reproducability and not in regards to deployment of applications what
the discussions on the web are.

Any comments, thoughts, remarks?

Rainer

Footnotes: 
[1]  http://docs.docker.io/en/latest/installation/mac/

-- 
Rainer M. Krug
email: Rainer<at>krugs<dot>de
PGP: 0x0F52F982
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140321/dff39167/attachment.pl>
G?bor Cs?rdi <csardi.gabor at gmail.com> writes:
You might want to look at packer as well, which can build virtual machines
from an ISO, without any user intaraction. I successfully used it to build
VMs with Linux, OSX and Windows. It can also create vagrant boxes. You can
specify provisioners, e.g. to install R, or a set of R packages, etc. It is
under heavy development, by the same team as vagrant.
I think I am getting lost in these - I looked ad Docker, and it looks
promising, but I actually didn't even manage to sh into the running
container. Is there somewhere an howto on how one can use these in R, to
the purpose discussed in this thread? If not, I really think this would
be needed. It is extremely difficult for me to translate what I want to
do into the deployment / management / development scenarios discussed in
the blogs I have found.

Cheers, 

(a confused)
Rainer
Gabor

On Fri, Mar 21, 2014 at 9:03 AM, Philippe GROSJEAN <
Philippe.GROSJEAN at umons.ac.be> wrote:

..............................................<}))><........
 ) ) ) ) )
( ( ( ( (    Prof. Philippe Grosjean
 ) ) ) ) )
( ( ( ( (    Numerical Ecology of Aquatic Systems
 ) ) ) ) )   Mons University, Belgium
( ( ( ( (
..............................................................

On 21 Mar 2014, at 10:59, Rainer M Krug <Rainer at krugs.de> wrote:

Dirk Eddelbuettel <edd at debian.org> writes:

o Roger correctly notes that R scripts and packages are just one issue.
  Compilers, libraries and the OS matter.  To me, the natural approach
these
  days would be to think of something based on Docker or Vagrant or (if
you
  must, VirtualBox).  The newer alternatives make snapshotting very
cheap
  (eg by using Linux LXC).  That approach reproduces a full environemnt
as
  best as we can while still ignoring the hardware layer (and some
readers
  may recall the infamous Pentium bug of two decades ago).
These two tools look very interesting - but I have, even after reading a
few discussions of their differences, no idea which one is better suited
to be used for what has been discussed here: Making it possible to run
the analysis later to reproduce results using the same versions used in
the initial analysis.

Am I right in saying:

- Vagrant uses VMs to emulate the hardware
- Docker does not

Yes.

wherefore
- Vagrant is slower and requires more space
- Docker is faster and requires less space

It depends. For instance, if you run R in VirtualBox under Windows, it may
run faster depending on the code you run and, say, the Lapack library used.
On Linux, you typically got R code run in the VM 2-3% slower than natively,
but In a Windows host, most of my R code runs faster in the VM... But yes,
you need more RAM.

With Vagrant, you do not need to keep you VM once you don't use it any
more. Then, disk space is shrunk down to a few kB, corresponding to the
Vagrant configuration file. I guess the same is true for Docker?

A big advantage of Vagrant + VirtualBox is that you got a very similar
virtual hardware, no matter if your host system is Linux, Windows or Mac OS
X. I see this as a good point for better reproducibility.

Therefore, could one say that Vagrant is more "robust" in the long run?

May be,... but it depends almost entirely how VirtualBox will support old
VMs in the future!

PhG

How do they compare in relation to different platforms? Vagrant seems to
be platform agnostic, I can develop and run on Linux, Mac and Windows -
how does it work with Docker?

I just followed [1] and setup Docker on OSX - loos promising - it also
uses an underlying VM. SO both should be equal in regards to
reproducability in the long run?

Please note: I see these questions in the light of this discussion of
reproducability and not in regards to deployment of applications what
the discussions on the web are.

Any comments, thoughts, remarks?

Rainer

Footnotes:
[1]  http://docs.docker.io/en/latest/installation/mac/

--
Rainer M. Krug
email: Rainer<at>krugs<dot>de
PGP: 0x0F52F982
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

	[[alternative HTML version deleted]]

Rainer M. Krug
email: Rainer<at>krugs<dot>de
PGP: 0x0F52F982
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 494 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140321/64e36fe4/attachment.bin>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140321/87f960b4/attachment.pl>

On Fri, Mar 21, 2014 at 11:12 AM, Rainer M Krug <Rainer at krugs.de> wrote:

G?bor Cs?rdi <csardi.gabor at gmail.com> writes:

You might want to look at packer as well, which can build virtual
machines
from an ISO, without any user intaraction. I successfully used it to
build
VMs with Linux, OSX and Windows. It can also create vagrant boxes. You
can
specify provisioners, e.g. to install R, or a set of R packages, etc. It
is
under heavy development, by the same team as vagrant.
I think I am getting lost in these - I looked ad Docker, and it looks
promising, but I actually didn't even manage to sh into the running
container. Is there somewhere an howto on how one can use these in R, to
the purpose discussed in this thread? If not, I really think this would
be needed. It is extremely difficult for me to translate what I want to
do into the deployment / management / development scenarios discussed in
the blogs I have found.

I haven't tried Docker, so I cannot say anything about that. The purpose of
vagrant and packer is slightly different, but there seems to be some
overlap.

Packer helps you building a virtual machine from an ISO, automatically,
without any human interaction. That's pretty much it. The result can be a
VirtualBox, VMWare, etc. virtual machine, or even a vagrant box. I used it
to build Ubuntu, OSX and Windows boxes, it works great if you have a
working configuration. If you need to tweak a config to install additional
software, etc. then it requires some experimenting and patience, because
debugging is not that great.

Vagrant manages disposable virtual machines. I.e. it takes a vagrant box,
which is essentially a VM and some extra configuration info, provisions it,
which usually means installing software or setting up a development
environment, and then manages it, so that you can ssh to it, or do whatever
you want with it.

There are a number of boxes available, so if you want a minimal VM with
Ubuntu32, it takes one command to create it from a public box, another one
starting it, and a third one to ssh to it. It is literally a couple of
minutes, downloading the box takes longest. If you have the box, then it is
even quicker.

You can use packer and vagrant together. Packer creates the vagrant box,
sets up a very minimal environment. Then you can use vagrant with this box.

In my opinion it is somewhat cumbersome to use this for everyday work,
although good virtualization software definitely helps.

Gabor

Additional info: you access R into the VM from within the host by ssh. You can enable x11 forwarding there and you also got GUI stuff. It works like a charm, but there are still some problems on my side when I try to disconnect and reconnect to the same R process. I can solve this with, say, screen. However, if any X11 window is displayed while I disconnect, R crashes immediately on reconnection.
Best,

PhG
	[[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140322/c9bf6b76/attachment.pl>
On 22 Mar 2014 12:38, "Philippe GROSJEAN" <Philippe.GROSJEAN at umons.ac.be>
wrote:
On 21 Mar 2014, at 20:21, G??bor Cs??rdi <csardi.gabor at gmail.com> wrote:
In my opinion it is somewhat cumbersome to use this for everyday work,
although good virtualization software definitely helps.

Gabor

Additional info: you access R into the VM from within the host by ssh.
You can enable x11 forwarding there and you also got GUI stuff. It works
like a charm, but there are still some problems on my side when I try to
disconnect and reconnect to the same R process. I can solve this with, say,
screen. However, if any X11 window is displayed while I disconnect, R
crashes immediately on reconnection.

You might find the program 'xpra' useful. It's like screen, but for x11
programs.

-n
I second that. However, by default, xpra and GNU Screen are not aware of 
each other. To connect to xpra from within GNU Screen, you usually need 
to set the DISPLAY environment variable manually. I have described a 
solution that automates this, so that GUI applications "just work" from 
within GNU Screen and also survive a disconnect: 
http://krlmlr.github.io/integrating-xpra-with-screen/ .

-Kirill
| You might find the program 'xpra' useful. It's like screen, but for x11
| programs.

There are also NXserver/NXclien/FreeNX which keep 'x11 / xdm sessions' and
can resume / reconnect when the client dies.  I find x2go quite useful at work.

Dirk
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
Thanks everybody for their input - interesting suggestions and useful
information - thanks.

But I am still struggling to use this information. What I got so far:

1) I have decided to try docker [1]
2) Installed docker and boot2docker on a Mac via homebrew and it works
3) I found some Dockerfiles to create an image with R and ssh
4) The dockerfile runs and creates the image
5) I can interactively connect to the image by using bash and R is
running there
5) As I am using emacs /  ess, I want to use ssh do R stuff (other
suggestions welcome)

Problems:
1) I don't manage to connect to the running docker image following [2] -
I even managed to freeze my computer while trying.
2) Even if I could, I understand that the ssh port would be different each
time - not very nice. Is there a way of setting the port?

Questions:

1) Am I right in saying, that I have to use ssh to access the running
image, or is there a (faster?) alternative? I mean - I am working
locally and I don't need any encryption.

2) Would Vagrant make the process easier?

And finally:

I think it would be great if this information could be collected in a
wiki page, as I did not find anything about the usage scenario of docker
/ vagrant discussed here - I will certainly see that I blog about my
tries.

Cheers,

Rainer

Kirill M?ller <kirill.mueller at ivt.baug.ethz.ch> writes:
On 03/22/2014 02:10 PM, Nathaniel Smith wrote:
On 22 Mar 2014 12:38, "Philippe GROSJEAN" <Philippe.GROSJEAN at umons.ac.be>
wrote:
On 21 Mar 2014, at 20:21, G??bor Cs??rdi <csardi.gabor at gmail.com> wrote:
In my opinion it is somewhat cumbersome to use this for everyday work,
although good virtualization software definitely helps.

Gabor

Additional info: you access R into the VM from within the host by ssh.
You can enable x11 forwarding there and you also got GUI stuff. It works
like a charm, but there are still some problems on my side when I try to
disconnect and reconnect to the same R process. I can solve this with, say,
screen. However, if any X11 window is displayed while I disconnect, R
crashes immediately on reconnection.

You might find the program 'xpra' useful. It's like screen, but for x11
programs.

-n
I second that. However, by default, xpra and GNU Screen are not aware
of each other. To connect to xpra from within GNU Screen, you usually
need to set the DISPLAY environment variable manually. I have
described a solution that automates this, so that GUI applications
"just work" from within GNU Screen and also survive a disconnect:
http://krlmlr.github.io/integrating-xpra-with-screen/ .

-Kirill

Footnotes: 
[1]  https://www.docker.io

[2]  http://docs.docker.io/en/latest/examples/running_ssh_service/
Rainer M. Krug
email: Rainer<at>krugs<dot>de
PGP: 0x0F52F982
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 494 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140324/8a531819/attachment.bin>
Forgot: My Dockerfiloe is on github:

https://github.com/rkrug/R-docker

Rainer M Krug <Rainer at krugs.de> writes:
Thanks everybody for their input - interesting suggestions and useful
information - thanks.

But I am still struggling to use this information. What I got so far:

1) I have decided to try docker [1]
2) Installed docker and boot2docker on a Mac via homebrew and it works
3) I found some Dockerfiles to create an image with R and ssh
4) The dockerfile runs and creates the image
5) I can interactively connect to the image by using bash and R is
running there
5) As I am using emacs /  ess, I want to use ssh do R stuff (other
suggestions welcome)

Problems:
1) I don't manage to connect to the running docker image following [2] -
I even managed to freeze my computer while trying.
2) Even if I could, I understand that the ssh port would be different each
time - not very nice. Is there a way of setting the port?

Questions:

1) Am I right in saying, that I have to use ssh to access the running
image, or is there a (faster?) alternative? I mean - I am working
locally and I don't need any encryption.

2) Would Vagrant make the process easier?

And finally:

I think it would be great if this information could be collected in a
wiki page, as I did not find anything about the usage scenario of docker
/ vagrant discussed here - I will certainly see that I blog about my
tries.

Cheers,

Rainer

Kirill M?ller <kirill.mueller at ivt.baug.ethz.ch> writes:

On 03/22/2014 02:10 PM, Nathaniel Smith wrote:
On 22 Mar 2014 12:38, "Philippe GROSJEAN" <Philippe.GROSJEAN at umons.ac.be>
wrote:
On 21 Mar 2014, at 20:21, G??bor Cs??rdi <csardi.gabor at gmail.com> wrote:
In my opinion it is somewhat cumbersome to use this for everyday work,
although good virtualization software definitely helps.

Gabor

Additional info: you access R into the VM from within the host by ssh.
You can enable x11 forwarding there and you also got GUI stuff. It works
like a charm, but there are still some problems on my side when I try to
disconnect and reconnect to the same R process. I can solve this with, say,
screen. However, if any X11 window is displayed while I disconnect, R
crashes immediately on reconnection.

You might find the program 'xpra' useful. It's like screen, but for x11
programs.

-n
I second that. However, by default, xpra and GNU Screen are not aware
of each other. To connect to xpra from within GNU Screen, you usually
need to set the DISPLAY environment variable manually. I have
described a solution that automates this, so that GUI applications
"just work" from within GNU Screen and also survive a disconnect:
http://krlmlr.github.io/integrating-xpra-with-screen/ .

-Kirill

Footnotes: 
[1]  https://www.docker.io

[2]  http://docs.docker.io/en/latest/examples/running_ssh_service/

Rainer M. Krug
email: Rainer<at>krugs<dot>de
PGP: 0x0F52F982
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 494 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140324/3feea779/attachment.bin>
o Roger correctly notes that R scripts and packages are just one issue.
  Compilers, libraries and the OS matter.  To me, the natural approach these
  days would be to think of something based on Docker or Vagrant or (if you
  must, VirtualBox).  The newer alternatives make snapshotting very cheap
  (eg by using Linux LXC).  That approach reproduces a full environemnt as
  best as we can while still ignoring the hardware layer (and some readers
  may recall the infamous Pentium bug of two decades ago).
At one of my previous jobs we did effectively this (albeit in a lower tech 
fashion). Every project had its own environment, complete with the exact 
snapshot of R & packages used, etc. All scripts/code was kept in that 
environment in a versioned fashion such that at any point one could go to 
any stage of development of that paper/project's analysis and reproduce it 
exactly.

It was hugely inefficient in terms of storage, but it solved the problem 
we're discussing here. As you note, with the tools available today it'd be 
trivial to distribute that environment for people to reproduce results.