Skip to content

What "method" does sort() use?

4 messages · Peter Dalgaard, Patrick Connolly

#
I don't follow why this happens:
[1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E"

The help for sort() says:

  method: character string specifying the algorithm used.  Not
          available for partial sorting.  Can be abbreviated.

But what are the methods available?  The help mentions xtfrm but that
doesn't illuminate, I'd have thought that at least by default it would
have something to do with ASCII codes.  But that's not the case since
all the uppercase ones would be before the lowercase ones.

I know something different is happening but I don't know what it is
(do you, Mr Jones?).  Apologies to Bob Dylan.
#
On 18 Mar 2016, at 10:02 , Patrick Connolly <p_connolly at slingshot.co.nz> wrote:

            
Um, read _all_ of the help file?

sort.int(x, partial = NULL, na.last = NA, decreasing = FALSE,
         method = c("shell", "quick"), index.return = FALSE)

[snip]

Method "shell" uses Shellsort (an O(n^{4/3}) variant from Sedgewick (1986)). If x has names a stable modification is used, so ties are not reordered. (This only matters if names are present.)

Method "quick" uses Singleton (1969)'s implementation of Hoare's Quicksort method and is only available when x is numeric (double or integer) and partial is NULL. (For other types of x Shellsort is used, silently.) It is normally somewhat faster than Shellsort (perhaps 50% faster on vectors of length a million and twice as fast at a billion) but has poor performance in the rare worst case. (Peto's modification using a pseudo-random midpoint is used to make the worst case rarer.) This is not a stable sort, and ties may be reordered.

Factors with less than 100,000 levels are sorted by radix sorting when method is not supplied: see sort.list.

-pd
#
Ooops, that was answering the question you actually asked. The one you meant to ask is answered by this  part:

The sort order for character vectors will depend on the collating sequence of the locale in use: see Comparison. 

...and collating sequences is a weird and woolly subject, where you cannot even be sure that locales of the same name on two different platforms sort strings in the same order.

-pd
On 18 Mar 2016, at 10:13 , peter dalgaard <pdalgd at gmail.com> wrote:

            

  
    
1 day later
#
I did look at the Comparison help but totally overlooked this part;

     Comparison of strings in character vectors is lexicographic within
     the strings using the collating sequence of the locale in use: see
     ?locales?.  The collating sequence of locales such as ?en_US? is
     normally different from ?C? (which should use ASCII) and can be
     surprising.  

I've recently changed to a different Linux distribution and was trying
to work out why I was getting a different order of the factor levels
even though it was the same code.  I thought I'd inadvertantly changed
something somehow.

Much clearer now -- even if still confusing.  

Thanks for the pointer.
On Fri, 18-Mar-2016 at 10:19AM +0100, peter dalgaard wrote:
|> Ooops, that was answering the question you actually asked. The one you meant to ask is answered by this  part:
|> 
|> The sort order for character vectors will depend on the collating sequence of the locale in use: see Comparison. 
|> 
|> ...and collating sequences is a weird and woolly subject, where you cannot even be sure that locales of the same name on two different platforms sort strings in the same order.
|> 
|> -pd
|> 
|> 
|>
|> On 18 Mar 2016, at 10:13 , peter dalgaard <pdalgd at gmail.com> wrote:
|> 
|> >
|> > On 18 Mar 2016, at 10:02 , Patrick Connolly <p_connolly at slingshot.co.nz> wrote:
|> > 
|> >> I don't follow why this happens:
|> >> 
|> >>> sort(c(LETTERS[1:5], letters[1:5]))
|> >> [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E"
|> >> 
|> >> The help for sort() says:
|> >> 
|> >> method: character string specifying the algorithm used.  Not
|> >>         available for partial sorting.  Can be abbreviated.
|> >> 
|> >> But what are the methods available?  The help mentions xtfrm but that
|> >> doesn't illuminate, I'd have thought that at least by default it would
|> >> have something to do with ASCII codes.  But that's not the case since
|> >> all the uppercase ones would be before the lowercase ones.
|> >> 
|> >> I know something different is happening but I don't know what it is
|> >> (do you, Mr Jones?).  Apologies to Bob Dylan.
|> >> 
|> > 
|> > 
|> > Um, read _all_ of the help file?
|> > 
|> > sort.int(x, partial = NULL, na.last = NA, decreasing = FALSE,
|> >         method = c("shell", "quick"), index.return = FALSE)
|> > 
|> > [snip]
|> > 
|> > Method "shell" uses Shellsort (an O(n^{4/3}) variant from Sedgewick (1986)). If x has names a stable modification is used, so ties are not reordered. (This only matters if names are present.)
|> > 
|> > Method "quick" uses Singleton (1969)'s implementation of Hoare's Quicksort method and is only available when x is numeric (double or integer) and partial is NULL. (For other types of x Shellsort is used, silently.) It is normally somewhat faster than Shellsort (perhaps 50% faster on vectors of length a million and twice as fast at a billion) but has poor performance in the rare worst case. (Peto's modification using a pseudo-random midpoint is used to make the worst case rarer.) This is not a stable sort, and ties may be reordered.
|> > 
|> > Factors with less than 100,000 levels are sorted by radix sorting when method is not supplied: see sort.list.
|> > 
|> > -pd
|> > 
|> > 
|> > -- 
|> > Peter Dalgaard, Professor,
|> > Center for Statistics, Copenhagen Business School
|> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
|> > Phone: (+45)38153501
|> > Office: A 4.23
|> > Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
|> 
|> -- 
|> Peter Dalgaard, Professor,
|> Center for Statistics, Copenhagen Business School
|> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
|> Phone: (+45)38153501
|> Office: A 4.23
|> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
|>