Request to speed up save()
On 15/01/2015 12:45, Stewart Morris wrote:
Hi, I am dealing with very large datasets and it takes a long time to save a workspace image.
Sounds like bad practice on your part ... saving images is not recommended for careful work.
The options to save compressed data are: "gzip", "bzip2" or "xz", the default being gzip. I wonder if it's possible to include the pbzip2 (http://compression.ca/pbzip2/) algorithm as an option when saving.
It is not an 'algorithm', it is a command-line utility widely available for Linux at least.
"PBZIP2 is a parallel implementation of the bzip2 block-sorting file compressor that uses pthreads and achieves near-linear speedup on SMP machines. The output of this version is fully compatible with bzip2 v1.0.2 or newer" I tested this as follows with one of my smaller datasets, having only read in the raw data: ============ # Dumped an ascii image save.image(file='test', ascii=TRUE)
Why do that if you are at all interested in speed? A pointless (and inaccurate) binary to decimal conversion is needed.
# At the shell prompt: ls -l test -rw-rw-r--. 1 swmorris swmorris 1794473126 Jan 14 17:33 test time bzip2 -9 test 364.702u 3.148s 6:14.01 98.3% 0+0k 48+1273976io 1pf+0w time pbzip2 -9 test 422.080u 18.708s 0:11.49 3836.2% 0+0k 0+1274176io 0pf+0w ============ As you can see, bzip2 on its own took over 6 minutes whereas pbzip2 took 11 seconds, admittedly on a 64 core machine (running at 50% load). Most modern machines are multicore so everyone would get some speedup.
But R does not by default save bzip2-ed ASCII images ... and gzip is the default because its speed/compression tradeoffs (see ?save) are best for the typical R user. And your last point is a common misunderstanding, that people typically have lots of spare cores which are zero-price. Even on my 8 (virtual) core desktop when I typically do have spare cores, using them has a price in throttling turbo mode and cache contention. Quite a large price: an R session may run 1.5-2x slower if 7 other tasks are run in parallel.
Is this feasible/practical? I am not a developer so I'm afraid this would be down to someone else...
Not in base R. For example one would need a linkable library, which the site you quote is not obviously providing. Nothing is stopping you writing a sensible uncompressed image and optionally compressing it externally, but note that for some file systems compressed saves are faster because of reduced I/O.
Thoughts?
Cheers, Stewart
Brian D. Ripley, ripley at stats.ox.ac.uk Emeritus Professor of Applied Statistics, University of Oxford 1 South Parks Road, Oxford OX1 3TG, UK