I'd like to raise this again now that 4.4 is out. Below is a more complete patch which includes a function to properly cleanup libcurl when R quits. Implementing this is a little tricky because libcurl is a separate "module" in R, perhaps there is a better way, but this works: view: https://github.com/r-devel/r-svn/pull/166/files patch: https://github.com/r-devel/r-svn/pull/166.diff The old patch is still there as well, which is meant a minimal proof of concept to test the performance gains for reusing the connection: view: https://github.com/r-devel/r-svn/pull/155/files patch: https://github.com/r-devel/r-svn/pull/155.diff Performance gains are greatest on high-bandwidth servers when downloading many files from the same server (e.g. packages from a cran mirror). In such cases, currently over 90% of the total time is spent on initiating and tearing town a separate SSL connection for each file download. Thoughts?
On Sat, Mar 2, 2024 at 3:07?PM Jeroen Ooms <jeroenooms at gmail.com> wrote:
Currently download.file() creates and terminates a new TLS connection for each download. This creates a lot of overhead which is expensive for both client and server (in particular the TLS handshake). Modern internet clients (including browsers) re-use connections for many http requests. We can do this in R by creating a persistent libcurl "multi-handle". The R libcurl implementation already uses a multi-handle, however it destroys it after each download, which defeats the purpose. The purpose of the multi-handle is to keep it alive and let libcurl maintain a persistent connection pool. This is particularly relevant for install.packages() which needs to download many files from one and the same server. Here is a bare minimal proof of concept patch that re-uses one and the same multi-handle for all requests in R: https://github.com/r-devel/r-svn/pull/155/files Some quick benchmarking shows that this can lead to big speedups for download.packages() on high-bandwidth servers (such as CI). This quick test to download 100 packages from CRAN showed more than 10x speedup for me: https://github.com/r-devel/r-svn/pull/155 Moreover, I think this may make install.packages() more robust. In CI build logs that download many packages, I often see one or two downloads randomly failing with a TLS-connect error. I am hopeful this problem will disappear when using a single connection to the CRAN server to download all the packages.