Skip to content

[R-pkg-devel] suggestion: conda for third-party software

7 messages · Serguei Sokol, Ivan Krylov, Kevin Ushey

#
Best wishes for 2020!

I would like to suggest a new feature for R package management. Its aim 
is to enable package developers and end-users to rely on conda ( 
https://docs.conda.io/en/latest/ ) for managing third-party software 
(TPS) on major platforms: linux64, win64 and osx64. Currently, many R 
packages include TPS as part of them thus bloating their sizes and often 
duplicating files on a given system.? And even when TPS is not included 
in an R package but is just installed on a system, it is not so obvious 
to get the right path to it. Sometimes pkg-config helps but it is not 
always present.

So, the new feature would be to let R package developers to write in 
DESCRIPTION/SystemRequirements field something like 
'conda:boost-cpp>=1.71' where 'boost-cpp' is an example of a conda 
package and '>=1.71' is an optional version requirement. Having this 
could allow install.packages() to install TPS on a testing CRAN machine 
or on an end-user's one. (There is just one line to execute in a shell: 
conda install <pkg-name>. It will install the package itself as well as 
all its dependencies).

To my mind, this feature would have the following advantages:
 ?- on-disk size economy as the same TPS does not have to be included in 
R package itself and can be shared with other language wrappers, e.g. 
Python;
 ?- an easy flag configuring in Makevars as paths to TPS will be well 
known in advance;
 ?- CRAN machines could test packages relying on a wide panel of TPS 
without bothering with their manual installation;
 ?- TPS installation can become transparent for the end-user on major 
platforms;

Note that even R is part of conda ( 
https://anaconda.org/conda-forge/r-base ), it is not mandatory to use 
the conda's R version for this feature. Here, conda is just meant to 
facilitate access to TPS. However, a minimal requirement is obviously to 
have conda itself.

Does it look reasonable? appealing?
Best,
Serguei.
#
The newest version of reticulate does something very similar: R
packages can declare their Python package dependencies in the
Config/reticulate field of a DESCRIPTION file, and reticulate can read
and use those dependencies to provision a Python environment for the
user when requested (currently using Miniconda).

Similarly, rather than having this part of SystemRequirements, package
authors could declare these in a separate field called e.g.
Config/conda. Then, you could have an R package that knows how to read
and parse these configuration requests, and install those packages for
the user.

That said, maintaining a Conda installation and its environments is
non-trivial, and things do not always work as expected when mixing
Conda applications with non-Conda applications. Most notably, Conda
installations bundle their own copies of libraries; e.g. the C++
standard library, Qt, OpenSSL, and so on. If an application tries to
mix and match both system-provided and Conda-provided libraries in the
same process, bad things often happen. This was still the
lowest-friction way forward for us with reticulate, but it's worth
being aware that Conda is not a total panacea.

Best,
Kevin
On Tue, Jan 7, 2020 at 6:50 AM Serguei Sokol <serguei.sokol at gmail.com> wrote:
#
Thanks for this hint.

Le 07/01/2020 ? 20:47, Kevin Ushey a ?crit?:
If miniconda is used, does it mean that not only Python but any conda 
package can be indicated in dependency ?

And another question, do you know if miniconda is installed on testing 
CRAN machines? (Without this I cannot see how your packages with conda 
dependencies could be tested during their submission.)

Best,

Serguei.
#
On Tue, 7 Jan 2020 15:49:45 +0100
Serguei Sokol <serguei.sokol at gmail.com> wrote:

            
I agree that making a package depend on a third-party library means
finding oneself in a bit of a pickle. A really popular library like
cURL could be "just" depended upon (for the price of some problems when
building on Windows). A really small (e.g. 3 source files) and rarely
updated (just once last year) library like liborigin could "just" be
bundled (but the package maintainer would have to constantly watch out
for new versions of the library). Finding that the bundled version of a
network-facing library in an R package (e.g. libuv in httpuv) is several
minor versions out of date is always a bit scary, even if it turns out
that no major security flaws have been found in that version (just a few
low-probability resource leaks, one unlikely NULL pointer dereference
and some portability problems). The road to dependency hell is paved
with intentions of code reuse.
While I appreciate the effort behind Anaconda, I would hate to see it
being *required* to depend on third-party binaries compiled by a
fourth-party (am I counting my parties right?) when there's already a
copy installed and available via means the user trusts more (e.g. via
GNU/Linux distro package, or Homebrew on macOS, or just a copy sitting
in /usr/local installed manually from source). In this regard, a
separate field like "Config/conda" suggested by Kevin Ushey sounds like
a good idea: if one wants to use Anaconda, the field is there. If one
doesn't, one can just ignore it and provide the necessary dependencies
in a different way.
#
Le 08/01/2020 ? 08:50, Ivan Krylov a ?crit?:
The same would apply for my proposition: if you want, you use 
conda:something if not you do like before. But anyway, I don't make a 
campaign for 'conda:' tag in SystemRequirements. Kevin's Config/conda 
solution seems to be sufficient for this issue. Just, I was not aware 
that it was already there.

Best,
Serguei.
#
On Tue, Jan 7, 2020 at 10:42 PM Sokol Serguei <serguei.sokol at gmail.com> wrote:
In theory yes, but reticulate only accepts Python package dependencies
since its primary goal is interoperation with Python.
I don't think so. I can't speak for CRAN, but their time is precious
and it seems unlikely to me that they would be willing to expend the
time needed to maintain Conda installations across their fleet of CRAN
machines.

Packages using Miniconda in this way could still run their tests on
different types of infrastructure, though (e.g. Travis CI).
#
It would also be worth looking at the basilisk package:

https://github.com/LTLA/basilisk

where the approach used there is to instead embed a Conda installation
as part of the R package itself. This comes with the benefit that it's
now the package author's responsibility to maintain the Conda
installation (not CRAN nor the users), but does have the drawback that
installing or upgrading that Conda environment may become more
challenging.

One other large benefit of this approach is that it forces R package
authors who want to use Python through reticulate to standardize on
the same environment. Note that reticulate can only bind to a single
Python session per R session, so attempting to have R packages which
use incompatible Python dependencies could quickly become an issue.
(Python packages tend to rely on virtual environments, and so Python
packages tend to declare more narrow dependency version requirements.)
Hence, having a "standardized" Python environment that can be used by
R packages through reticulate (or other Python-wrapping packages)
should be very useful.

If you're curious, there's a more detailed discussion here:

https://github.com/LTLA/basilisk/issues/2

Best,
Kevin
On Wed, Jan 8, 2020 at 8:34 AM Kevin Ushey <kevinushey at gmail.com> wrote: