Skip to content

Request to speed up save()

5 messages · Stewart Morris, Brian Ripley, Dénes Tóth +2 more

#
Hi,

I am dealing with very large datasets and it takes a long time to save a 
workspace image.

The options to save compressed data are: "gzip", "bzip2" or "xz", the 
default being gzip. I wonder if it's possible to include the pbzip2 
(http://compression.ca/pbzip2/) algorithm as an option when saving.

"PBZIP2 is a parallel implementation of the bzip2 block-sorting file 
compressor that uses pthreads and achieves near-linear speedup on SMP 
machines. The output of this version is fully compatible with bzip2 
v1.0.2 or newer"

I tested this as follows with one of my smaller datasets, having only 
read in the raw data:

============
# Dumped an ascii image
save.image(file='test', ascii=TRUE)

# At the shell prompt:
ls -l test
-rw-rw-r--. 1 swmorris swmorris 1794473126 Jan 14 17:33 test

time bzip2 -9 test
364.702u 3.148s 6:14.01 98.3%	0+0k 48+1273976io 1pf+0w

time pbzip2 -9 test
422.080u 18.708s 0:11.49 3836.2%	0+0k 0+1274176io 0pf+0w
============

As you can see, bzip2 on its own took over 6 minutes whereas pbzip2 took 
11 seconds, admittedly on a 64 core machine (running at 50% load). Most 
modern machines are multicore so everyone would get some speedup.

Is this feasible/practical? I am not a developer so I'm afraid this 
would be down to someone else...

Thoughts?

Cheers,

Stewart
#
On 15/01/2015 12:45, Stewart Morris wrote:
Sounds like bad practice on your part ... saving images is not 
recommended for careful work.
It is not an 'algorithm', it is a command-line utility widely available 
for Linux at least.
Why do that if you are at all interested in speed?  A pointless (and 
inaccurate) binary to decimal conversion is needed.
But R does not by default save bzip2-ed ASCII images ... and gzip is the 
default because its speed/compression tradeoffs (see ?save) are best for 
the typical R user.

And your last point is a common misunderstanding, that people typically 
have lots of spare cores which are zero-price.  Even on my 8 (virtual) 
core desktop when I typically do have spare cores, using them has a 
price in throttling turbo mode and cache contention.  Quite a large 
price: an R session may run 1.5-2x slower if 7 other tasks are run in 
parallel.
Not in base R.  For example one would need a linkable library, which the 
site you quote is not obviously providing.

Nothing is stopping you writing a sensible uncompressed image and 
optionally compressing it externally, but note that for some file 
systems compressed saves are faster because of reduced I/O.

  
    
#
On 01/15/2015 01:45 PM, Stewart Morris wrote:
Take a look at the gdsfmt package. It supports the superfast Lz4 
compression algorithm + it provides highly optimized functions to write 
to/read from disk.
https://github.com/zhengxwen/gdsfmt
#
In addition to the major points that others made: if you care about speed, don't use compression. With today's fast disks it's an order of magnitude slower to use compression:
user  system elapsed 
 17.210   0.148  17.397
user  system elapsed 
  0.482   0.355   0.929 

The above example is intentionally well compressible, in real life the differences are actually even bigger. As people that deal with big data know well, disks are no longer the bottleneck - it's the CPU now.

Cheers,
Simon

BTW: why in the world would you use ascii=TRUE? It's pretty much the slowest possible serialization you can use - it will even overshadow compression:
user  system elapsed 
  0.459   0.383   0.940
user  system elapsed 
 36.713   0.140  36.929 

and the same goes for reading:
user  system elapsed 
 27.616   0.275  27.948
user  system elapsed 
  0.609   0.184   0.795
#
On Thu, Jan 15, 2015 at 11:08 AM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
Respectfully, while your example would imply this, I don't think this
is correct in the general case.    Much faster compression schemes
exist, and using these can improve disk I/O tremendously.  Some
schemes that are so fast that it's even faster to transfer compressed
data from main RAM to CPU cache and then decompress to avoid being
limited by RAM bandwidth: https://github.com/Blosc/c-blosc

Repeating that for emphasis, compressing and uncompressing can be
actually be faster than a straight memcpy()!

Really, the issue is that 'gzip' and 'bzip2' are bottlenecks.   As
Steward suggests, this can be mitigated by throwing more cores at the
problem.  This isn't a bad solution, as there are often excess
underutilized cores.  But much better would be to choose a faster
compression scheme first, and then parallelize that across cores if
still necessary.

Sometimes the tradeoff is between amount of compression and speed, and
sometimes some algorithms are just faster than others.   Here's some
sample data for the test file that your example creates:
user  system elapsed
  0.554   0.336   0.890

nate at ubuntu:~/R/rds$ ls -hs test.rds
382M test.rds
nate at ubuntu:~/R/rds$ time gzip -c test.rds > test.rds.gz
real: 16.207 sec
nate at ubuntu:~/R/rds$ ls -hs test.rds.gz
35M test.rds.gz
nate at ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard
real: 2.330 sec

nate at ubuntu:~/R/rds$ time gzip -c --fast test.rds > test.rds.gz
real: 4.759 sec
nate at ubuntu:~/R/rds$ ls -hs test.rds.gz
56M test.rds.gz
nate at ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard
real: 2.942 sec

nate at ubuntu:~/R/rds$ time pigz -c  test.rds > test.rds.gz
real: 2.180 sec
nate at ubuntu:~/R/rds$ ls -hs test.rds.gz
35M test.rds.gz
nate at ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard
real: 2.375 sec

nate at ubuntu:~/R/rds$ time pigz -c --fast test.rds > test.rds.gz
real: 0.739 sec
nate at ubuntu:~/R/rds$ ls -hs test.rds.gz
57M test.rds.gz
nate at ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard
real: 2.851 sec

nate at ubuntu:~/R/rds$ time lz4c test.rds > test.rds.lz4
Compressed 400000102 bytes into 125584749 bytes ==> 31.40%
real: 1.024 sec
nate at ubuntu:~/R/rds$ ls -hs test.rds.lz4
120M test.rds.lz4
nate at ubuntu:~/R/rds$ time lz4 test.rds.lz4 > discard
Compressed 125584749 bytes into 95430573 bytes ==> 75.99%
real: 0.775 sec

Reading that last one more closely, with single threaded lz4
compression, we're getting 3x compression at about 400MB/s, and
decompression at about 500MB/s.   This is faster than almost any
single disk will be.  Multithreaded implementations will make even the
fastest RAID be the bottleneck.

It's probably worth noting that the speeds reported in your simple
example for the uncompressed case are likely the speed of writing to
memory, with the actual write to disk happening at some later time.
Sustained throughput will likely be slower than your example would
imply

If saving data to disk is a bottleneck, I think Stewart is right that
there is a lot of room for improvement.

--nate