Benchmarking R, why sort() is so slow? - R-devel

Fri, Apr 27, 2001 8:54 AM #

TL> On Fri, 27 Apr 2001, Philippe Grosjean wrote:

>> I suspect it should be a question of algorithm choice. May be sort() is
    >> important enough, including for large datasets, to place some improvement in
    >> the "to do" list for a further version...?
    >> 

    TL> Actually, I'm not convinced on the speed issue. Shellsort works
    TL> very well for vectors of lengths that R can reasonably
    TL> handle. Using a million record dataset in R you really can't get
    TL> much done in the 20seconds that a really good sorting routine might
    TL> save you.

    TL> Heapsort might be worth trying -- it has good worst-case behaviour
    TL> and is about half as fast as the typical case of quicksort.  Merge
    TL> sort might also be nice because it's stable: it leaves ties in the
    TL> order it found them, but it needs extra space. Still, I don't think
    TL> it's a high optimisation priority.

I tend to disagree.
For me, R is not just for data analysis.
It's my compute engine for almost everything, nowadays.
I definitely have wanted to add better sort() routines to R,
and I think R's default should be to use some  O(n log(n)) versions, at
least for larghish n.
20 years ago (or so), ``the'' good thing 
 {{for in-memory sorting; note that we might consider a different algorithm 
   for the case of a really large array where [i] is really disk access}}
was Singleton's Quickersort
  |  R.C. Singleton 'Revised Quickersort'
  |  (CACM Algorithm 347, Sept.1968)

(quicker than quicksort), a version of which (that does additionally return
 the sorted index) I had re-programmed in Pascal and then C.
I can put it up for ftp if there's interest.

The GSL (GNU scientific library) has sorting routines, too,
all of them heapsort.

http://sources.redhat.com/gsl/ref/gsl-ref_17.html#SEC253

I.e., they seem to say one should really use  qsort() whenever that is
available... hmm,
could we use  configure  to find if qsort() is available?

Martin

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Ross Ihaka

Fri, Apr 27, 2001 3:19 PM #

You may be overlooking a major reason for the slowness of the
R sorting --- the fact that the element comparisons are not inlined.
Comparisons dominate the cost of the sort, so any speedup of them
will make sorting faster.

It might be worth looking at replacing the present comparison
function definitions with macros to see what sort of speedup
you get.  It might also be useful to try to take advantage of
IEEE arithmetic behaviour, or perhaps do a pre-sweep of the
data to first place the NAs and Infs in their correct positions
at the start or the end, and so simplify the comparison code.
This might prove a bigger win than a switch of algorithm.

If you do plan on switching to Quicksort, you may want to read up
on tuning the algorithm.  One of John Bentley's "Programming Pearls"
chapters is on tuning Quicksort.  He does some simple things which
produce a sort 4 times faster than the "system sort".  I think
there is also some work by Bentley and Rick Becker on problems in
the Unix qsort implementation.

The choice to use Shellsort for R was based on advice in Segedin's
"Algorithms" book.  He says:

  "If one is not willing to invest the effort to be sure
  that a Quicksort implementation is not flawed, Shellsort
  is a much safer choice and will perform adequately for
  significantly less implementation effort."

Being the lazy sod that I am, I took this advice to heart.

Ross Ihaka                         Email:  ihaka@stat.auckland.ac.nz
Department of Statistics           Phone:  (64-9) 373-7599 x 5054
University of Auckland             Fax:    (64-9) 373-7018
Private Bag 92019, Auckland
New Zealand
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Peter Dalgaard

Fri, Apr 27, 2001 5:27 PM #

Ross Ihaka <ihaka@stat.auckland.ac.nz> writes:

Hmm. I'd suspect that declaring them static would make gcc grab any
chance of inlining..

This sounds like it would be worth trying.

O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._