Skip to content

Package compression benchmarks for zstd vs gzip

7 messages · Jeroen Ooms, Avraham Adler, Henrik Bengtsson +3 more

#
Many distros and browsers these days use zstd as the preferred
compression method. For example if you unpack a .deb or .rpm file on
Debian or Fedora there is zstd archive inside. It is claimed that zstd
offers improved compression over gzip, but (unlike lzma) it has
comparable decompression speed. Maybe it is interesting to get an
estimate of how much R packages would benefit from zstd.

Testing this for source packages and MacOS binary packages it is easy
as we can gunzip and recompress tar.gz files without having to extract
the tarball itself:

  OUTPUT="sizes.txt"
  echo "FILE GZIP ZSTD" > $OUTPUT
  for x in *gz; do
    FILE=$(basename $x)
    GZIP=$(wc -c "$x" | awk '{print $1}')
    ZSTD=$(gunzip -c $x | zstd -19 | wc -c)
    echo "$FILE $GZIP $ZSTD" | tee -a $OUTPUT
  done

Attached are results of running this script on the 500 most downloaded
CRAN packages. It shows about 16% size reduction for sources, and 19%
for binaries.

Zstd is BSD licensed C code that can easily be embedded in any project.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: sources.txt
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20250111/90f91d5e/attachment.txt>

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: binaries.txt
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20250111/90f91d5e/attachment-0001.txt>
1 day later
#
zstd is accessible within R using the archive package [1]. I use it
all the time when saving large objects, using code I adapted from [2].
Is your suggestion to import the libraries/source code into base?

[1] https://CRAN.R-project.org/package=archive
[2] https://coolbutuseless.github.io/2018/10/02/using-lz4-and-zstandard-to-compress-files-with-saverds/
On Fri, Jan 10, 2025 at 6:17?PM Jeroen Ooms <jeroenooms at gmail.com> wrote:
#
Can't speak for Jeroen, but it sounds like it's worth adding support
for tar.zstd package files, just like how tar.gz, tar.xz, and
tar.bzip2 are currently supported. I'd also argue for support zstd
compression throughout R, including adding zstdfile(), support for
saveRDS(..., compress = "zstd"), and so on. Then it could be discussed
later what the default(s) should be.

It's probably also worth looking at package compression with 'xz'
compression. In [1], Mike FC has a graph where 'bzip2' and 'xz' seem
to give the best compression ratios, at least for RDS files.

FWIW, Mike FC submitted the 'zstdlite' package [1] to CRAN about a
year ago, but it was removed, resubmitted, then removed again. I
believe this was Mike FC first ever CRAN submission, but I think they
eventually gave up. From
https://cran.r-project.org/src/contrib/PACKAGES.in:

Package: zstdlite
X-CRAN-Comment: Removed on 2024-03-18 for repeated policy violation.
  .
  Does not look for suitable system 'libzstd'.
  Spams personal email addresses of team members.
X-CRAN-History: Removed on 2024-03-13 for policy violation and
misrepresentation of copyright holder(s).
  .
  Does not even attempt to use system 'libzstd'.
  Back on CRAN on 2024-03-17.

[1] https://github.com/coolbutuseless/zstdlite

/Henrik
On Sat, Jan 11, 2025 at 3:41?PM Avraham Adler <avraham.adler at gmail.com> wrote:
#
On Sat, 11 Jan 2025 16:05:46 -0800
Henrik Bengtsson <henrik.bengtsson at gmail.com> wrote:

            
'bzip2' can be surprisingly good on very repetitive payloads. It
compresses 0x80000000 zero bytes to only 1.5 KiB, much better than 'xz
-9' with 305 KiB (with compression settings not making much
difference), although the compression is not perfect. One terabyte of
zeros can be compressed to 697202 bytes of repetitive compressed stream
that can be bzipped further to 248 bytes.

Binary packages are probably the most obvious target for new
compression methods because there is no need to install them on older
versions of R.
#
I think the first step would have to be to add zstd support to R. zstd is a bit controversial (as shown by the community blowback of the changes you mentioned) and their build system (calling it that is being very generous) is mess so it would require a bit of testing, but it is doable.

That said, assuming the above is solved, we have been debating the change of compression at CRAN in general for a bit, but the assumptions about the file names are built into today?s tools so there would be certainly some fall-out - not just in R, but also the ecosystems around it. As you pointed out, possibly the safest place to start are binaries, since we have tighter control of those and they are used in fewer places.

Personally, I think the higher priority is signing, so as we address that we may just include the compression change with it since it will require some tool changes anyway. I was thinking of using xz as that is more stable, already supported and less controversial, but I don?t think the choice really matters - it just has to be a compression which R supports (zstd and xz have different benefits, so it?s always a trade-off without a clear winner).

Cheers,
Simon
2 days later
#
With the changes to add zstd support yesterday, the build of R-devel is failing when zstd is not present, even though the docs say that zstd is optional.

The error comes in building the datasets package, see e.g. https://github.com/r-devel/r-svn/actions/runs/12760693086/job/35566530112.

Best wishes,

Heather
On Mon, Jan 13, 2025, at 1:26 AM, Simon Urbanek wrote:
#
Heather,

thanks, now fixed (datasets was using numeric value for compress= instead of the compression name so it picked zstd instead of gzip - now the switch order is kept the same).

Cheers,
Simon