Many distros and browsers these days use zstd as the preferred
compression method. For example if you unpack a .deb or .rpm file on
Debian or Fedora there is zstd archive inside. It is claimed that zstd
offers improved compression over gzip, but (unlike lzma) it has
comparable decompression speed. Maybe it is interesting to get an
estimate of how much R packages would benefit from zstd.
Testing this for source packages and MacOS binary packages it is easy
as we can gunzip and recompress tar.gz files without having to extract
the tarball itself:
OUTPUT="sizes.txt"
echo "FILE GZIP ZSTD" > $OUTPUT
for x in *gz; do
FILE=$(basename $x)
GZIP=$(wc -c "$x" | awk '{print $1}')
ZSTD=$(gunzip -c $x | zstd -19 | wc -c)
echo "$FILE $GZIP $ZSTD" | tee -a $OUTPUT
done
Attached are results of running this script on the 500 most downloaded
CRAN packages. It shows about 16% size reduction for sources, and 19%
for binaries.
Zstd is BSD licensed C code that can easily be embedded in any project.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: sources.txt
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20250111/90f91d5e/attachment.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: binaries.txt
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20250111/90f91d5e/attachment-0001.txt>
Package compression benchmarks for zstd vs gzip
7 messages · Jeroen Ooms, Avraham Adler, Henrik Bengtsson +3 more
1 day later
zstd is accessible within R using the archive package [1]. I use it all the time when saving large objects, using code I adapted from [2]. Is your suggestion to import the libraries/source code into base? [1] https://CRAN.R-project.org/package=archive [2] https://coolbutuseless.github.io/2018/10/02/using-lz4-and-zstandard-to-compress-files-with-saverds/
On Fri, Jan 10, 2025 at 6:17?PM Jeroen Ooms <jeroenooms at gmail.com> wrote:
Many distros and browsers these days use zstd as the preferred
compression method. For example if you unpack a .deb or .rpm file on
Debian or Fedora there is zstd archive inside. It is claimed that zstd
offers improved compression over gzip, but (unlike lzma) it has
comparable decompression speed. Maybe it is interesting to get an
estimate of how much R packages would benefit from zstd.
Testing this for source packages and MacOS binary packages it is easy
as we can gunzip and recompress tar.gz files without having to extract
the tarball itself:
OUTPUT="sizes.txt"
echo "FILE GZIP ZSTD" > $OUTPUT
for x in *gz; do
FILE=$(basename $x)
GZIP=$(wc -c "$x" | awk '{print $1}')
ZSTD=$(gunzip -c $x | zstd -19 | wc -c)
echo "$FILE $GZIP $ZSTD" | tee -a $OUTPUT
done
Attached are results of running this script on the 500 most downloaded
CRAN packages. It shows about 16% size reduction for sources, and 19%
for binaries.
Zstd is BSD licensed C code that can easily be embedded in any project.
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Can't speak for Jeroen, but it sounds like it's worth adding support for tar.zstd package files, just like how tar.gz, tar.xz, and tar.bzip2 are currently supported. I'd also argue for support zstd compression throughout R, including adding zstdfile(), support for saveRDS(..., compress = "zstd"), and so on. Then it could be discussed later what the default(s) should be. It's probably also worth looking at package compression with 'xz' compression. In [1], Mike FC has a graph where 'bzip2' and 'xz' seem to give the best compression ratios, at least for RDS files. FWIW, Mike FC submitted the 'zstdlite' package [1] to CRAN about a year ago, but it was removed, resubmitted, then removed again. I believe this was Mike FC first ever CRAN submission, but I think they eventually gave up. From https://cran.r-project.org/src/contrib/PACKAGES.in: Package: zstdlite X-CRAN-Comment: Removed on 2024-03-18 for repeated policy violation. . Does not look for suitable system 'libzstd'. Spams personal email addresses of team members. X-CRAN-History: Removed on 2024-03-13 for policy violation and misrepresentation of copyright holder(s). . Does not even attempt to use system 'libzstd'. Back on CRAN on 2024-03-17. [1] https://github.com/coolbutuseless/zstdlite /Henrik
On Sat, Jan 11, 2025 at 3:41?PM Avraham Adler <avraham.adler at gmail.com> wrote:
zstd is accessible within R using the archive package [1]. I use it all the time when saving large objects, using code I adapted from [2]. Is your suggestion to import the libraries/source code into base? [1] https://CRAN.R-project.org/package=archive [2] https://coolbutuseless.github.io/2018/10/02/using-lz4-and-zstandard-to-compress-files-with-saverds/ On Fri, Jan 10, 2025 at 6:17?PM Jeroen Ooms <jeroenooms at gmail.com> wrote:
Many distros and browsers these days use zstd as the preferred
compression method. For example if you unpack a .deb or .rpm file on
Debian or Fedora there is zstd archive inside. It is claimed that zstd
offers improved compression over gzip, but (unlike lzma) it has
comparable decompression speed. Maybe it is interesting to get an
estimate of how much R packages would benefit from zstd.
Testing this for source packages and MacOS binary packages it is easy
as we can gunzip and recompress tar.gz files without having to extract
the tarball itself:
OUTPUT="sizes.txt"
echo "FILE GZIP ZSTD" > $OUTPUT
for x in *gz; do
FILE=$(basename $x)
GZIP=$(wc -c "$x" | awk '{print $1}')
ZSTD=$(gunzip -c $x | zstd -19 | wc -c)
echo "$FILE $GZIP $ZSTD" | tee -a $OUTPUT
done
Attached are results of running this script on the 500 most downloaded
CRAN packages. It shows about 16% size reduction for sources, and 19%
for binaries.
Zstd is BSD licensed C code that can easily be embedded in any project.
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Sat, 11 Jan 2025 16:05:46 -0800
Henrik Bengtsson <henrik.bengtsson at gmail.com> wrote:
It's probably also worth looking at package compression with 'xz' compression. In [1], Mike FC has a graph where 'bzip2' and 'xz' seem to give the best compression ratios, at least for RDS files.
'bzip2' can be surprisingly good on very repetitive payloads. It compresses 0x80000000 zero bytes to only 1.5 KiB, much better than 'xz -9' with 305 KiB (with compression settings not making much difference), although the compression is not perfect. One terabyte of zeros can be compressed to 697202 bytes of repetitive compressed stream that can be bzipped further to 248 bytes. Binary packages are probably the most obvious target for new compression methods because there is no need to install them on older versions of R.
Best regards, Ivan
I think the first step would have to be to add zstd support to R. zstd is a bit controversial (as shown by the community blowback of the changes you mentioned) and their build system (calling it that is being very generous) is mess so it would require a bit of testing, but it is doable. That said, assuming the above is solved, we have been debating the change of compression at CRAN in general for a bit, but the assumptions about the file names are built into today?s tools so there would be certainly some fall-out - not just in R, but also the ecosystems around it. As you pointed out, possibly the safest place to start are binaries, since we have tighter control of those and they are used in fewer places. Personally, I think the higher priority is signing, so as we address that we may just include the compression change with it since it will require some tool changes anyway. I was thinking of using xz as that is more stable, already supported and less controversial, but I don?t think the choice really matters - it just has to be a compression which R supports (zstd and xz have different benefits, so it?s always a trade-off without a clear winner). Cheers, Simon
On 11 Jan 2025, at 12:16, Jeroen Ooms <jeroenooms at gmail.com> wrote:
Many distros and browsers these days use zstd as the preferred
compression method. For example if you unpack a .deb or .rpm file on
Debian or Fedora there is zstd archive inside. It is claimed that zstd
offers improved compression over gzip, but (unlike lzma) it has
comparable decompression speed. Maybe it is interesting to get an
estimate of how much R packages would benefit from zstd.
Testing this for source packages and MacOS binary packages it is easy
as we can gunzip and recompress tar.gz files without having to extract
the tarball itself:
OUTPUT="sizes.txt"
echo "FILE GZIP ZSTD" > $OUTPUT
for x in *gz; do
FILE=$(basename $x)
GZIP=$(wc -c "$x" | awk '{print $1}')
ZSTD=$(gunzip -c $x | zstd -19 | wc -c)
echo "$FILE $GZIP $ZSTD" | tee -a $OUTPUT
done
Attached are results of running this script on the 500 most downloaded
CRAN packages. It shows about 16% size reduction for sources, and 19%
for binaries.
Zstd is BSD licensed C code that can easily be embedded in any project.
<sources.txt><binaries.txt>______________________________________________
R-devel at r-project.org mailing list
2 days later
With the changes to add zstd support yesterday, the build of R-devel is failing when zstd is not present, even though the docs say that zstd is optional. The error comes in building the datasets package, see e.g. https://github.com/r-devel/r-svn/actions/runs/12760693086/job/35566530112. Best wishes, Heather
On Mon, Jan 13, 2025, at 1:26 AM, Simon Urbanek wrote:
I think the first step would have to be to add zstd support to R. zstd is a bit controversial (as shown by the community blowback of the changes you mentioned) and their build system (calling it that is being very generous) is mess so it would require a bit of testing, but it is doable. That said, assuming the above is solved, we have been debating the change of compression at CRAN in general for a bit, but the assumptions about the file names are built into today?s tools so there would be certainly some fall-out - not just in R, but also the ecosystems around it. As you pointed out, possibly the safest place to start are binaries, since we have tighter control of those and they are used in fewer places. Personally, I think the higher priority is signing, so as we address that we may just include the compression change with it since it will require some tool changes anyway. I was thinking of using xz as that is more stable, already supported and less controversial, but I don?t think the choice really matters - it just has to be a compression which R supports (zstd and xz have different benefits, so it?s always a trade-off without a clear winner). Cheers, Simon
On 11 Jan 2025, at 12:16, Jeroen Ooms <jeroenooms at gmail.com> wrote:
Many distros and browsers these days use zstd as the preferred
compression method. For example if you unpack a .deb or .rpm file on
Debian or Fedora there is zstd archive inside. It is claimed that zstd
offers improved compression over gzip, but (unlike lzma) it has
comparable decompression speed. Maybe it is interesting to get an
estimate of how much R packages would benefit from zstd.
Testing this for source packages and MacOS binary packages it is easy
as we can gunzip and recompress tar.gz files without having to extract
the tarball itself:
OUTPUT="sizes.txt"
echo "FILE GZIP ZSTD" > $OUTPUT
for x in *gz; do
FILE=$(basename $x)
GZIP=$(wc -c "$x" | awk '{print $1}')
ZSTD=$(gunzip -c $x | zstd -19 | wc -c)
echo "$FILE $GZIP $ZSTD" | tee -a $OUTPUT
done
Attached are results of running this script on the 500 most downloaded
CRAN packages. It shows about 16% size reduction for sources, and 19%
for binaries.
Zstd is BSD licensed C code that can easily be embedded in any project.
<sources.txt><binaries.txt>______________________________________________
R-devel at r-project.org mailing list
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Heather, thanks, now fixed (datasets was using numeric value for compress= instead of the compression name so it picked zstd instead of gzip - now the switch order is kept the same). Cheers, Simon
On Jan 15, 2025, at 10:21 PM, Heather Turner <ht at heatherturner.net> wrote: With the changes to add zstd support yesterday, the build of R-devel is failing when zstd is not present, even though the docs say that zstd is optional. The error comes in building the datasets package, see e.g. https://github.com/r-devel/r-svn/actions/runs/12760693086/job/35566530112. Best wishes, Heather On Mon, Jan 13, 2025, at 1:26 AM, Simon Urbanek wrote:
I think the first step would have to be to add zstd support to R. zstd is a bit controversial (as shown by the community blowback of the changes you mentioned) and their build system (calling it that is being very generous) is mess so it would require a bit of testing, but it is doable. That said, assuming the above is solved, we have been debating the change of compression at CRAN in general for a bit, but the assumptions about the file names are built into today?s tools so there would be certainly some fall-out - not just in R, but also the ecosystems around it. As you pointed out, possibly the safest place to start are binaries, since we have tighter control of those and they are used in fewer places. Personally, I think the higher priority is signing, so as we address that we may just include the compression change with it since it will require some tool changes anyway. I was thinking of using xz as that is more stable, already supported and less controversial, but I don?t think the choice really matters - it just has to be a compression which R supports (zstd and xz have different benefits, so it?s always a trade-off without a clear winner). Cheers, Simon
On 11 Jan 2025, at 12:16, Jeroen Ooms <jeroenooms at gmail.com> wrote:
Many distros and browsers these days use zstd as the preferred
compression method. For example if you unpack a .deb or .rpm file on
Debian or Fedora there is zstd archive inside. It is claimed that zstd
offers improved compression over gzip, but (unlike lzma) it has
comparable decompression speed. Maybe it is interesting to get an
estimate of how much R packages would benefit from zstd.
Testing this for source packages and MacOS binary packages it is easy
as we can gunzip and recompress tar.gz files without having to extract
the tarball itself:
OUTPUT="sizes.txt"
echo "FILE GZIP ZSTD" > $OUTPUT
for x in *gz; do
FILE=$(basename $x)
GZIP=$(wc -c "$x" | awk '{print $1}')
ZSTD=$(gunzip -c $x | zstd -19 | wc -c)
echo "$FILE $GZIP $ZSTD" | tee -a $OUTPUT
done
Attached are results of running this script on the 500 most downloaded
CRAN packages. It shows about 16% size reduction for sources, and 19%
for binaries.
Zstd is BSD licensed C code that can easily be embedded in any project.
<sources.txt><binaries.txt>______________________________________________
R-devel at r-project.org mailing list
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel