Skip to content

Proposal to limit Internet access during package load

22 messages · Simon Urbanek, Gabriel Becker, Bob Rudis +4 more

#
Hi all,

I'd like to open this debate here, because IMO this is a big issue.
Many packages do this for various reasons, some more legitimate than
others, but I think that this shouldn't be allowed, because it
basically means that installation fails in a machine without Internet
access (which happens e.g. in Linux distro builders for security
reasons).

Now, what if connection is suppressed during package load? There are
basically three use cases out there:

(1) The package requires additional files for the installation (e.g.
the source code of an external library) that cannot be bundled into
the package due to CRAN restrictions (size).
(2) The package requires additional files for using it (e.g.,
datasets, a JAR...) that cannot be bundled into the package due to
CRAN restrictions (size).
(3) Other spurious reasons (e.g. the maintainer decided that package
load was a good place to check an online service availability, etc.).

Again IMO, (3) shouldn't be allowed in any case; (2) should be a
separate function that the user actively calls to download the files,
and those files should be placed into the user dir, and (3) is the
only legitimate use, but then other mechanism should be provided to
avoid connections during package load.

My proposal to support (3) would be to add a new field in the
DESCRIPTION, "Additional_sources", which would be a comma separated
list of additional resources to download during R CMD INSTALL. Those
sources would be downloaded by R CMD INSTALL if not provided via an
option (to support offline installations), and would be placed in a
predefined place for the package to find and configure them (via an
environment variable or in a predefined subdirectory).

This proposal has several advantages. Apart from the obvious one
(Internet access during package load can be limited without losing
current functionalities), it gives more visibility to the resources
that packages are using during the installation phase, and thus makes
those installations more reproducible and more secure.

Best,
#
On Fri, 23 Sept 2022 at 17:22, I?aki Ucar <iucar at fedoraproject.org> wrote:
I meant "(1) is the only legitimate use" above.
#
I?aki,

I fully agree, this a very common issue since vast majority of server deployments I have encountered don't allow internet access. In practice this means that such packages are effectively banned.

I would argue that not even (1) or (2) are really an issue, because in fact the CRAN policy doesn't impose any absolute limits on size, it only states that the package should be "of minimum necessary size" which means it shouldn't waste space. If there is no way to reduce the size without impacting functionality, it's perfectly fine.

That said, there are exceptions such as very large datasets (e.g., as distributed by Bioconductor) which are orders of magnitude larger than what is sustainable. I agree that it would be nice to have a mechanism for specifying such sources. So yes, I like the idea, but I'd like to see more real use cases to justify the effort.

The issue with any online downloads, though, is that there is no guarantee of availability - which is real issue for reproducibility. So one could argue that if such external sources are required then they should be on a well-defined, independent, permanent storage such as Zenodo. This could be a matter of policy as opposed to the technical side above which would be adding such support to R CMD INSTALL.

Cheers,
Simon
2 days later
#
On Sat, 24 Sept 2022 at 01:55, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
"Packages should be of the minimum necessary size" is subject to
interpretation. And in practice, there is an issue with e.g. packages
that "bundle" big third-party libraries. There are also packages that
require downloading precompiled code, JARs... at installation time.
"More real use cases" like in "more use cases" or like in "the
previous ones are not real ones"? :)
Not necessarily. If the package declares the additional sources in the
DESCRIPTION (probably with hashes), that's a big improvement over the
current state of things, in which basically we don't know what the
package tries download, then it may fail, and finally there's no
guarantee that it's what the author intended in the first place.

But on top of this, R could add a CMD to download those, and then some
lookaside storage could be used on CRAN. This is e.g. how RPM
packaging works: the spec declares all the sources, they are
downloaded once, hashed and stored in a lookaside cache. Then package
building doesn't need general Internet connectivity, just access to
the cache.

I?aki

  
    
#
JARs are part of the package, so that's a valid use, no question there, that's how Java packages do this already.

Downloading pre-compiled binaries is something that shouldn't be done and a whole can of worms (since those are not sources and it *is* specific to the platform, os etc.) that is entirely separate, but worth a separate discussion. So I still don't see any use cases for actual sources. I do see a need for better specification of external dependencies which are not part of the package such that those can be satisfied automatically - but that's not the problem you asked about.
Sure, I fully agree that it would be a good first step, but I'm still waiting for examples ;).

Cheers,
Simon
#
Hi Simon,

The example of this I'm aware of that is most popular and widely used "in
the wild" is the stringi package (which is a dep of the widely used stringr
pkg) whose configure file downloads the ICU Data Library (icudt).

See https://github.com/gagolews/stringi/blob/master/configure#L5412

Note it does have some sort of workaround in place for non-internet-capable
build machines, but it is external (the build in question fails without the
workaround already explicitly performed).

Best,
~G



On Mon, Sep 26, 2022 at 12:50 PM Simon Urbanek <simon.urbanek at r-project.org>
wrote:

  
  
#
On Mon, 26 Sept 2022 at 21:50, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
Oh, you want me to actually name specific packages? I thought that
this was a well-established fact from your initial statement "I fully
agree, this a very common issue [...]", so I preferred to avoid
pointing fingers.

But of course you can start by taking a look at [1], where all
packages marked as "internet" or "cargo" are downloading stuff at
install time. There are some others that are too important to get rid
of, so I just build them with an Internet connection from time to
time. Or have them patched to avoid such downloads.

And others have been fixed after me opening an issue when a package
blows up when I try to build an RPM with it. But this is like playing
cat and mouse if this is not enforced somehow.

[1] https://github.com/Enchufa2/cran2copr/blob/master/excl-no-sysreqs.txt
#
Gabe,

that's a great example how **not** to do it and why it is such a bad idea. icu4c is a system library, so it is generally available and it already includes the data in the system library, so embedding data from an outdated version is generally bad. I'm not sure why it should be needed in the first place, since icu actually tries to avoid the need for external files, so I'd say this would be ideally fixed in stringi.

That said, if you want to cache static data from the system library, that is an option, but should be done at build time from the system (no internet needed) - it is a common practice - have a look at sf (and other packages that copy projections data from PROJ). So, yes, that's a good argument for disallowing downloads to detect such issues in packages.

Cheers,
Simon
#
I?aki,

I'm not sure I understand - system dependencies are an entirely different topic and I would argue a far more important one (very happy to start a discussion about that), but that has nothing to do with declaring downloads. I assumed your question was about large files in packages which packages avoid to ship and download instead so declaring them would be useful. And for that, the obvious answer is they shouldn't do that - if a package needs a file to run, it should include it. So an easy solution is to disallow it.

But so far all examples where just (ab)use of downloads for binary dependencies which is an entirely different issue that needs a different solution (in a naive way declaring such dependencies, but we know it's not that simple - and download URLs don't help there).

Cheers,
Simon
#
On Mon, 26 Sept 2022 at 23:07, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
Exactly. Maybe there's a misunderstanding, because I didn't talk about
system dependencies (alas there are packages that try to download
things that are declared as system dependencies, as Gabe noted). :)
Then we completely agree. My proposal about declaring additional
sources was because, given that so many packages do this, I thought
that I would find a strong opposition to this. But if R Core / CRAN is
ok with just limiting net access at install time, then that's perfect
to me. :)

I?aki

  
    
#
Ok, understood. I would like to tackle those as well, but let's start that conversation in a few weeks when I have a lot more time.
Yes we do agree :). I started looking at your list, and so far those seem simply bugs or design deficiencies in the packages (and outright policy violations). I think the only reason they exist is that it doesn't get detected in CRAN incoming, it's certainly not intentional.

Cheers,
Simon
#
For the record, the only things switchr (my package) is doing internet wise
should be hitting the bioconductor config file (
http://bioconductor.org/config.yaml) so that it knows the things it need to
know about Bioc repos/versions/etc (at load time, actually, not install
time, but since install does a test load, those are essentially the same).

I have fallback behavior for when the file can't be read, so there
shouldn't be any actual build breakages/install breakages I don't think,
but the check does happen.

Advice on what to do for the above use case that is better practice is
welcome.

~G

On Mon, Sep 26, 2022 at 2:40 PM Simon Urbanek <simon.urbanek at r-project.org>
wrote:

  
  
#
$ sandbox-exec -n no-network R CMD INSTALL switchr_0.14.5.tar.gz 
[...]
** testing if installed package can be loaded from final location
Error in readLines(con) : 
  cannot open the connection to 'http://bioconductor.org/config.yaml'
Calls: <Anonymous> ... getBiocDevelVr -> getBiocYaml -> inet_handlers -> readLines
Execution halted
ERROR: loading failed

So, yes, it does break. You should recover from the error and use a fall-back file that you ship.

Cheers,
Simon
#
BTW: It is a good question whether packages that require internet access in order to function at all should be flagged as such so they can be removed from server installations. Let's say if a package provides an API for retrieving stock quotes online and it's all it does then perhaps it does make sense to exclude it. It would be pointless to appease the load check just to not be able to perform the function it was designed for...

Cheers,
Simon
#
I would personally like something like an Android/iOS permissions
required/requested manifest document describing what the pkg needs
with R doing what it can to enforce said permissions. R would be
breaking some ground in this space, but it does that regularly in many
respects. Yes, I know I just 10x++ the scope.

I'd support just this flag, tho. Anything to increase transparency and safety.

On Mon, Sep 26, 2022 at 6:22 PM Simon Urbanek
<simon.urbanek at r-project.org> wrote:
#
Regarding 'system' libraries: Packages like stringi and nloptr download the
source of, respectively, libicu or libnlopt and build a library _if_ the
library is not found locally.  If we outlaw this, more users may hit a brick
wall because they cannot install system libraries (for lack of permissions),
or don't know how to, or ...  These facilities were not added to run afoul of
best practices -- they were added to help actual users. Something to keep in
mind. 

Dirk
#
Ah, thats embarrassing. Thats a bug in how/where I handle lack of
connectivity, rather than me not doing it. I've just push a fix to the
github repo that now cleanly passes check  with no internet connectivity
(much more stringent).

Using a canned file is a bit odd, because in the case where there's no
connectivity, the package  won't work (the canned file would just set the
repositories to URLs that R still won't be able to reach).

Anyway,
Thanks
~G

On Mon, Sep 26, 2022 at 3:11 PM Simon Urbanek <simon.urbanek at r-project.org>
wrote:

  
  
#
El mar., 27 sept. 2022 4:22, Dirk Eddelbuettel <edd at debian.org> escribi?:
Yes, but then IMO Internet access should be explicitly enabled by the user
with a flag. By default, it should be disabled and packages on CRAN should
install as is.

I?aki
#
Dear all, 

my apologies for a dull question. I think I do understand that unnoticed Internet access requires scrutiny and a more explicit approach.

But I am not sure how this would impact on the practice on many Windows machines to download static libraries from one of the rwinlib repositories? See https://github.com/rwinlib, an approach taken by quite a few packages (src/Makevars.win triggers tools/winlibs.R for downloading a static library).

I am asking because a package I maintain (RcppCWB) uses the approach, and  am not sure whether and how the discussion has addressed this scenario. It may not be covered by I?akis initial three scenario?

Kind regards, Andreas





?Am 27.09.22, 10:15 schrieb "R-devel im Auftrag von I?aki Ucar" <r-devel-bounces at r-project.org im Auftrag von iucar at fedoraproject.org>:

    El mar., 27 sept. 2022 4:22, Dirk Eddelbuettel <edd at debian.org> escribi?:

    >
    > Regarding 'system' libraries: Packages like stringi and nloptr download the
    > source of, respectively, libicu or libnlopt and build a library _if_ the
    > library is not found locally.  If we outlaw this, more users may hit a
    > brick
    > wall because they cannot install system libraries (for lack of
    > permissions),
    > or don't know how to, or ...  These facilities were not added to run afoul
    > of
    > best practices -- they were added to help actual users. Something to keep
    > in
    > mind.


    Yes, but then IMO Internet access should be explicitly enabled by the user
    with a flag. By default, it should be disabled and packages on CRAN should
    install as is.

    I?aki


    ______________________________________________
    R-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-devel
#
El mar., 27 sept. 2022 18:42, Bl?tte, Andreas <andreas.blaette at uni-due.de>
escribi?:
AFAIK, packages should compile on CRAN with the distribution of packages
that CRAN has for Windows, and thus offline. Then the majority of Windows
users just download precompiled binaries.

The rwinlib stuff is a nice to have feature for power users compiling their
own packages. But yet again those power users could enable Internet access
with this hypothetical flag I proposed.

I?aki
#
On 9/27/22 18:42, Bl?tte, Andreas wrote:
Dear Andreas,

please let me clarify for others that your package only downloads 
pre-compiled static libraries for R 4.1 and earlier on Windows, what is 
already now the "old release" and will not be checked against by CRAN 
once R 4.3 is released. For R 4.2 ("release") and R-devel, your package 
includes the source code of the library and links it, which is in 
compliance with the CRAN policy. So if any new possible restriction in 
downloading along the lines Inaki raised was set, it probably won't 
affect you.

Downloading pre-compiled static libraries is only possible as very last 
resort and only with agreement of the CRAN team - see the CRAN policy 
for details (https://cran.r-project.org/web/packages/policies.html, look 
for "external"). In other words, unless those special conditions are 
met, it is already banned now.

More explanation for why it is a bad thing can be found in 
https://cran.r-project.org/bin/windows/base/howto-R-4.2.html:

"For transparency, source packages should contain source (not executable 
code). Using pre-compiled libraries may lead to that after few years the 
information on how they were built gets lost or significantly outdated 
and no longer working. Using older binary code may provide insufficient 
performance (newer compilers tend to optimize better). Also, the CRAN 
(and Bioconductor) repositories are used as a unique test suite not only 
for R itself but also the toolchain, and by re-using pre-compiled 
libraries, some parts will not be tested. Compiler bugs are found and 
when fixed, the code needs to be re-compiled. Finally, object files (and 
hence static libraries, particularly when using C++) on Windows tend to 
become incompatible when even the same toolchain is upgraded. Going from 
MSVCRT to UCRT is an extreme case when all such code becomes 
incompatible, and adding support to 64-bit ARM would be another extreme 
case, but smaller updates of different parts of the toolchain or even 
some libraries in it lead to incompatibilities. The issues mentioned 
here are based on experience with the transition to UCRT and Rtools42; 
all of these things have happened and dealing with the downloads and 
re-use of static libraries was one of the biggest challenges."

With respect to any possible restriction on downloading in principle, 
this is no different from downloading external source code (both is 
needed when the native code of packages is being built, so during "R CMD 
INSTALL"), and that has already been mentioned in this thread. So it 
doesn't have to be discussed specially I think.

Best
Tomas
#
Dear Tomas, thank you so much for the explanation. Very helpful for myself, and relevant for the wider context of packages using rwinlib!  Andreas

?Am 27.09.22, 20:18 schrieb "Tomas Kalibera" <tomas.kalibera at gmail.com>:
On 9/27/22 18:42, Bl?tte, Andreas wrote:
> Dear all,
    >
    > my apologies for a dull question. I think I do understand that unnoticed Internet access requires scrutiny and a more explicit approach.
    >
    > But I am not sure how this would impact on the practice on many Windows machines to download static libraries from one of the rwinlib repositories? See https://github.com/rwinlib, an approach taken by quite a few packages (src/Makevars.win triggers tools/winlibs.R for downloading a static library).
    >
    > I am asking because a package I maintain (RcppCWB) uses the approach, and  am not sure whether and how the discussion has addressed this scenario. It may not be covered by I?akis initial three scenario?

    Dear Andreas,

    please let me clarify for others that your package only downloads 
    pre-compiled static libraries for R 4.1 and earlier on Windows, what is 
    already now the "old release" and will not be checked against by CRAN 
    once R 4.3 is released. For R 4.2 ("release") and R-devel, your package 
    includes the source code of the library and links it, which is in 
    compliance with the CRAN policy. So if any new possible restriction in 
    downloading along the lines Inaki raised was set, it probably won't 
    affect you.

    Downloading pre-compiled static libraries is only possible as very last 
    resort and only with agreement of the CRAN team - see the CRAN policy 
    for details (https://cran.r-project.org/web/packages/policies.html, look 
    for "external"). In other words, unless those special conditions are 
    met, it is already banned now.

    More explanation for why it is a bad thing can be found in 
    https://cran.r-project.org/bin/windows/base/howto-R-4.2.html:

    "For transparency, source packages should contain source (not executable 
    code). Using pre-compiled libraries may lead to that after few years the 
    information on how they were built gets lost or significantly outdated 
    and no longer working. Using older binary code may provide insufficient 
    performance (newer compilers tend to optimize better). Also, the CRAN 
    (and Bioconductor) repositories are used as a unique test suite not only 
    for R itself but also the toolchain, and by re-using pre-compiled 
    libraries, some parts will not be tested. Compiler bugs are found and 
    when fixed, the code needs to be re-compiled. Finally, object files (and 
    hence static libraries, particularly when using C++) on Windows tend to 
    become incompatible when even the same toolchain is upgraded. Going from 
    MSVCRT to UCRT is an extreme case when all such code becomes 
    incompatible, and adding support to 64-bit ARM would be another extreme 
    case, but smaller updates of different parts of the toolchain or even 
    some libraries in it lead to incompatibilities. The issues mentioned 
    here are based on experience with the transition to UCRT and Rtools42; 
    all of these things have happened and dealing with the downloads and 
    re-use of static libraries was one of the biggest challenges."

    With respect to any possible restriction on downloading in principle, 
    this is no different from downloading external source code (both is 
    needed when the native code of packages is being built, so during "R CMD 
    INSTALL"), and that has already been mentioned in this thread. So it 
    doesn't have to be discussed specially I think.

    Best
    Tomas

    >
    > Kind regards, Andreas
    >
    >
    >
    >
    >
    > Am 27.09.22, 10:15 schrieb "R-devel im Auftrag von I?aki Ucar" <r-devel-bounces at r-project.org im Auftrag von iucar at fedoraproject.org>:
    >
    >      El mar., 27 sept. 2022 4:22, Dirk Eddelbuettel <edd at debian.org> escribi?:
    >
    >      >
    >      > Regarding 'system' libraries: Packages like stringi and nloptr download the
    >      > source of, respectively, libicu or libnlopt and build a library _if_ the
    >      > library is not found locally.  If we outlaw this, more users may hit a
    >      > brick
    >      > wall because they cannot install system libraries (for lack of
    >      > permissions),
    >      > or don't know how to, or ...  These facilities were not added to run afoul
    >      > of
    >      > best practices -- they were added to help actual users. Something to keep
    >      > in
    >      > mind.
    >
    >
    >      Yes, but then IMO Internet access should be explicitly enabled by the user
    >      with a flag. By default, it should be disabled and packages on CRAN should
    >      install as is.
    >
    >      I?aki
    >
    >      	[[alternative HTML version deleted]]
    >
    >      ______________________________________________
    >      R-devel at r-project.org mailing list
    >      https://stat.ethz.ch/mailman/listinfo/r-devel
    >
    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel