Skip to content

Duplicated mirrors on available packages

4 messages · Colin Gillespie, Maxim Nazarov, Kurt Hornik

#
Hi

When there are duplicated repos, available.packages() takes
significantly longer to run.

For example

mirror = "https://cloud.r-project.org/"
system.time(available.packages(repos = mirror))
#   user  system elapsed
# 1.054   0.031   1.245
system.time(available.packages(repos = c(mirror, mirror)))
#   user  system elapsed
# 22.389   0.037  22.429

Best wishes,

Colin
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.2.0 tools_4.2.0


Dr Colin Gillespie
https://twitter.com/csgillespie
2 days later
#
If you profile the second run, you will see that the majority of the time is spent in the `tools:::.remove_stale_dups` function, which loops over all duplicated packages - so all packages in that case.
One improvement I could think of is to replace the first line of that function
    pkgs <- ap[, "Package"]
with
    pkgs <- ap[!duplicated(ap[, c("Package", "Version")]), "Package"]
which would help in your example, but the function might still run longer if there are many packages with different versions present, so there maybe even better optimizations.

Stating the obvious here, but since we don't know your 'real' use case, adding a `unique` call to the `repos` argument of the `available.packages` would achieve a similar improvement without any modifications needed from `tools`.

Kind regards,
Maxim Nazarov

----- Original Message -----
From: "Colin Gillespie" <csgillespie at gmail.com>
To: "r-devel" <r-devel at r-project.org>
Sent: Friday, September 9, 2022 7:33:09 PM
Subject: [Rd] Duplicated mirrors on available packages

Hi

When there are duplicated repos, available.packages() takes
significantly longer to run.

For example

mirror = "https://cloud.r-project.org/"
system.time(available.packages(repos = mirror))
#   user  system elapsed
# 1.054   0.031   1.245
system.time(available.packages(repos = c(mirror, mirror)))
#   user  system elapsed
# 22.389   0.037  22.429

Best wishes,

Colin
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.2.0 tools_4.2.0


Dr Colin Gillespie
https://twitter.com/csgillespie

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
#
The use case came from the rig application
(https://github.com/r-lib/rig). Rig (I think) inserts the RStudio
package manager into the list of repos. This can cause duplication in
repos, hence the current issue.
Now that I know the reason, I can work around it.



On Mon, 12 Sept 2022 at 09:57, Maxim Nazarov
<maxim.nazarov at openanalytics.eu> wrote:
1 day later
#
Thanks for reporting this issue.  I just changed available.packages() a
la

-    for(repos in contriburl) {
+    for(repos in unique(contriburl)) {

which avoids the full duplication.

Best
-k