Hi, I am dealing with very large datasets and it takes a long time to save a workspace image. The options to save compressed data are: "gzip", "bzip2" or "xz", the default being gzip. I wonder if it's possible to include the pbzip2 (http://compression.ca/pbzip2/) algorithm as an option when saving. "PBZIP2 is a parallel implementation of the bzip2 block-sorting file compressor that uses pthreads and achieves near-linear speedup on SMP machines. The output of this version is fully compatible with bzip2 v1.0.2 or newer" I tested this as follows with one of my smaller datasets, having only read in the raw data: ============ # Dumped an ascii image save.image(file='test', ascii=TRUE) # At the shell prompt: ls -l test -rw-rw-r--. 1 swmorris swmorris 1794473126 Jan 14 17:33 test time bzip2 -9 test 364.702u 3.148s 6:14.01 98.3% 0+0k 48+1273976io 1pf+0w time pbzip2 -9 test 422.080u 18.708s 0:11.49 3836.2% 0+0k 0+1274176io 0pf+0w ============ As you can see, bzip2 on its own took over 6 minutes whereas pbzip2 took 11 seconds, admittedly on a 64 core machine (running at 50% load). Most modern machines are multicore so everyone would get some speedup. Is this feasible/practical? I am not a developer so I'm afraid this would be down to someone else... Thoughts? Cheers, Stewart
Stewart W. Morris Centre for Genomic and Experimental Medicine The University of Edinburgh United Kingdom The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.