Skip to content

split() - unexpected sorting of results

6 messages · Iñaki Ucar, Peter Meissner, Hervé Pagès +1 more

#
Hey,

I found this - for me - quite surprising and puzzling behaviour of split().


split(1:11, as.character(1:11))
split(1:11, 1:11)


When splitting by numerics everything works as expected - sorting of input
== sorting of output -- but when using a character vector everything gets
re-sorted alphabetical.


Although, there are some references in the help files to what happens when
using split, I did not find any note on this - for me - rather unexpected
behaviour.


I would like it best when the sorting of split results stays the same no
matter the input (sorting of input == sorting of output)

If that is not possibly a note of caution in the help pages and maybe an
example might be valuable.


Best, Peter
#
Hi Peter,

2017-10-20 21:33 GMT+02:00 Peter Meissner <retep.meissner at gmail.com>:
As the documentation states,

       f: a ?factor? in the sense that ?as.factor(f)? defines the
          grouping, or a list of such factors in which case their
          interaction is used for the grouping.

And, in fact,
[1] 1  2  3  4  5  6  7  8  9  10 11
Levels: 1 2 3 4 5 6 7 8 9 10 11
[1] 1  2  3  4  5  6  7  8  9  10 11
Levels: 1 10 11 2 3 4 5 6 7 8 9

Regards,
I?aki
#
Thanks, for the explanation.

Still, I think this is surprising bahaviour which might be handled better.

Best, Peter

Am 20.10.2017 9:49 nachm. schrieb "I?aki ?car" <i.ucar86 at gmail.com>:

  
  
#
Hi,
On 10/20/2017 12:53 PM, Peter Meissner wrote:
Maybe a little surprising, but no more than:

 > x <- sample(11L)

 > sort(x)
  [1]  1  2  3  4  5  6  7  8  9 10 11

 > sort(as.character(x))
  [1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"

The fact that sort(), as.factor(), split() and many other things behave
consistently with respect to the underlying order of character vectors
avoids other even bigger surprises.

Also note that the underlying order of character vectors actually
depends on your locale. One way to guarantee consistent results across
platforms/locales is by explicitly specifying the levels when making
a factor e.g.

   f <- factor(x, levels=unique(x))
   split(1:11, f)

This is particularly sensible when writing unit tests.

Cheers,
H.

  
    
#
Hello,

In order to solve that problem of sorting numerics made characters there 
is package stringr, functions str_sort and str_order.

library(stringr)

set.seed(2447)

x <- sample(11L)
sort(as.character(x))
[1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"

str_sort(as.character(x), numeric = TRUE)
[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"

str_order(as.character(x), numeric = TRUE)
#[1]  1  4 11  8  6  5  3 10  9  7  2

i <- str_order(as.character(x), numeric = TRUE)
as.character(x)[i]
#[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"


Unfortunately this does not solve the OP's question, factor(), 
as.factor(), split() and others use the base R sorter and this can only 
be changed by changing their sources.

Hope this helps,

Rui Barradas

Em 21-10-2017 00:32, Herv? Pag?s escreveu:
1 day later
#
Thank you all for your input - most appreciated.

Best, Peter

Am 21.10.2017 07:35 schrieb "Rui Barradas" <ruipbarradas at sapo.pt>: