On 27/09/2022, at 11:02 AM, Gabriel Becker <gabembecker at gmail.com>
For the record, the only things switchr (my package) is doing internet
wise should be hitting the bioconductor config file (
http://bioconductor.org/config.yaml) so that it knows the things it need
to know about Bioc repos/versions/etc (at load time, actually, not install
time, but since install does a test load, those are essentially the same).
I have fallback behavior for when the file can't be read, so there
shouldn't be any actual build breakages/install breakages I don't think,
but the check does happen.
$ sandbox-exec -n no-network R CMD INSTALL switchr_0.14.5.tar.gz
[...]
** testing if installed package can be loaded from final location
Error in readLines(con) :
cannot open the connection to 'http://bioconductor.org/config.yaml'
Calls: <Anonymous> ... getBiocDevelVr -> getBiocYaml -> inet_handlers ->
readLines
Execution halted
ERROR: loading failed
So, yes, it does break. You should recover from the error and use a
fall-back file that you ship.
Cheers,
Simon
Advice on what to do for the above use case that is better practice is
~G
On Mon, Sep 26, 2022 at 2:40 PM Simon Urbanek <
simon.urbanek at r-project.org> wrote:
On 27/09/2022, at 10:21 AM, I?aki Ucar <iucar at fedoraproject.org>
On Mon, 26 Sept 2022 at 23:07, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
I?aki,
I'm not sure I understand - system dependencies are an entirely
different topic and I would argue a far more important one (very happy to
start a discussion about that), but that has nothing to do with declaring
downloads. I assumed your question was about large files in packages which
packages avoid to ship and download instead so declaring them would be
useful.
Exactly. Maybe there's a misunderstanding, because I didn't talk about
system dependencies (alas there are packages that try to download things
that are declared as system dependencies, as Gabe noted). :)
Ok, understood. I would like to tackle those as well, but let's start
that conversation in a few weeks when I have a lot more time.
And for that, the obvious answer is they shouldn't do that - if a
package needs a file to run, it should include it. So an easy solution is
to disallow it.
Then we completely agree. My proposal about declaring additional
sources was because, given that so many packages do this, I thought that I
would find a strong opposition to this. But if R Core / CRAN is ok with
just limiting net access at install time, then that's perfect to me. :)
Yes we do agree :). I started looking at your list, and so far those
seem simply bugs or design deficiencies in the packages (and outright
policy violations). I think the only reason they exist is that it doesn't
get detected in CRAN incoming, it's certainly not intentional.
But so far all examples where just (ab)use of downloads for binary
dependencies which is an entirely different issue that needs a different
solution (in a naive way declaring such dependencies, but we know it's not
that simple - and download URLs don't help there).
On 27/09/2022, at 8:25 AM, Ucar <iucar at fedoraproject.org> wrote:
On Sat, 24 Sept 2022 at 01:55, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
I?aki,
I fully agree, this a very common issue since vast majority of
server deployments I have encountered don't allow internet access. In
practice this means that such packages are effectively banned.
I would argue that not even (1) or (2) are really an issue, because
in fact the CRAN policy doesn't impose any absolute limits on size, it only
states that the package should be "of minimum necessary size" which means
it shouldn't waste space. If there is no way to reduce the size without
impacting functionality, it's perfectly fine.
"Packages should be of the minimum necessary size" is subject to
interpretation. And in practice, there is an issue with e.g. packages
that "bundle" big third-party libraries. There are also packages that
require downloading precompiled code, JARs... at installation time.
That said, there are exceptions such as very large datasets (e.g.,
as distributed by Bioconductor) which are orders of magnitude larger than
what is sustainable. I agree that it would be nice to have a mechanism for
specifying such sources. So yes, I like the idea, but I'd like to see more
real use cases to justify the effort.
"More real use cases" like in "more use cases" or like in "the
previous ones are not real ones"? :)
The issue with any online downloads, though, is that there is no
guarantee of availability - which is real issue for reproducibility. So one
could argue that if such external sources are required then they should be
on a well-defined, independent, permanent storage such as Zenodo. This
could be a matter of policy as opposed to the technical side above which
would be adding such support to R CMD INSTALL.
Not necessarily. If the package declares the additional sources in
DESCRIPTION (probably with hashes), that's a big improvement over the
current state of things, in which basically we don't know what the
package tries download, then it may fail, and finally there's no
guarantee that it's what the author intended in the first place.
But on top of this, R could add a CMD to download those, and then
lookaside storage could be used on CRAN. This is e.g. how RPM
packaging works: the spec declares all the sources, they are
downloaded once, hashed and stored in a lookaside cache. Then package
building doesn't need general Internet connectivity, just access to
the cache.
I?aki
On Sep 24, 2022, at 3:22 AM, I?aki Ucar <iucar at fedoraproject.org>
Hi all,
I'd like to open this debate here, because IMO this is a big issue.
Many packages do this for various reasons, some more legitimate
others, but I think that this shouldn't be allowed, because it
basically means that installation fails in a machine without
access (which happens e.g. in Linux distro builders for security
reasons).
Now, what if connection is suppressed during package load? There
basically three use cases out there:
(1) The package requires additional files for the installation
the source code of an external library) that cannot be bundled into
the package due to CRAN restrictions (size).
(2) The package requires additional files for using it (e.g.,
datasets, a JAR...) that cannot be bundled into the package due to
CRAN restrictions (size).
(3) Other spurious reasons (e.g. the maintainer decided that
load was a good place to check an online service availability,
Again IMO, (3) shouldn't be allowed in any case; (2) should be a
separate function that the user actively calls to download the
and those files should be placed into the user dir, and (3) is the
only legitimate use, but then other mechanism should be provided to
avoid connections during package load.
My proposal to support (3) would be to add a new field in the
DESCRIPTION, "Additional_sources", which would be a comma separated
list of additional resources to download during R CMD INSTALL.
sources would be downloaded by R CMD INSTALL if not provided via an
option (to support offline installations), and would be placed in a
predefined place for the package to find and configure them (via an
environment variable or in a predefined subdirectory).
This proposal has several advantages. Apart from the obvious one
(Internet access during package load can be limited without losing
current functionalities), it gives more visibility to the resources
that packages are using during the installation phase, and thus
those installations more reproducible and more secure.
Best,
--
I?aki ?car