Skip to content

Sorting strings

16 messages · statquant2, Keith Jewell, Enrico Schumann +5 more

#
Hi all, I am having difficulties to understand how R sort strings:

If I do
R) sort(c("X.","X0B"))
[1] "X."  "X0B"

So for me, as far as lexicographic order is concerned I can add whatever to
the end, the order will remain the same, but :
R) sort(c("X.Z","X0B.Z"))
[1] "X0B.Z" "X.Z"

Can somebody give me a trick for the order to become lexicographic ?  



--
View this message in context: http://r.789695.n4.nabble.com/Sorting-strings-tp4403696p4403696.html
Sent from the R help mailing list archive at Nabble.com.
#
On Mon, Feb 20, 2012 at 02:18:42AM -0800, statquant2 wrote:
Hi.

This neednot be true for strings of different length.
For example

  ab
  abc

become by concatenation with z

  abcz
  abz

Petr Savicky.
#
"Petr Savicky" <savicky at cs.cas.cz> wrote in message 
news:20120220105153.GC21422 at cs.cas.cz...
That's not the explanation in this case.

The OP isn't telling us everything.
I get [R version 2.14.1 Platform: i386-pc-mingw32/i386 (32-bit)]:
[1] "X."  "X0B"
[1] "X.Z"   "X0B.Z"

KJ
#
See ?Comparison, which holds some warnings about what to expect when 
sorting strings.


Am 20.02.2012 11:51, schrieb Petr Savicky:

  
    
#
I don't *think* it's version specific, but rather it depends on your
(still unstated) locale, as the documentation goes to great lengths to
point out. Change that and you might see different behaviors.

Michael
On Mon, Feb 20, 2012 at 8:55 AM, statquant2 <statquant at gmail.com> wrote:
#
Hello,


statquant2 wrote
I don't know about 2.12.2 but for 2.12.0 I get:
_                            
platform       i386-pc-mingw32              
arch           i386                         
os             mingw32                      
system         i386, mingw32                
status                                      
major          2                            
minor          12.0                         
year           2010                         
month          10                           
day            15                           
svn rev        53317                        
language       R                            
version.string R version 2.12.0 (2010-10-15)
[1] "X."  "X0B"
[1] "X.Z"   "X0B.Z"

And the same for 2.14.1:
_                            
platform       i386-pc-mingw32
[... deleted...]
version.string R version 2.14.1 (2011-12-22)
[1] "X."  "X0B"
[1] "X.Z"   "X0B.Z"

Could it be OS related?

Rui Barradas.

--
View this message in context: http://r.789695.n4.nabble.com/Sorting-strings-tp4403696p4404267.html
Sent from the R help mailing list archive at Nabble.com.
#
Ok I have :

R) str(R.Version())
List of 13
 $ platform      : chr "x86_64-unknown-linux-gnu"
 $ arch          : chr "x86_64"
 $ os            : chr "linux-gnu"
 $ system        : chr "x86_64, linux-gnu"
 $ status        : chr ""
 $ major         : chr "2"
 $ minor         : chr "12.2"
 $ year          : chr "2011"
 $ month         : chr "02"
 $ day           : chr "25"
 $ svn rev       : chr "54585"
 $ language      : chr "R"
 $ version.string: chr "R version 2.12.2 (2011-02-25)"

R) sort(c("X.","X0B"))
[1] "X."  "X0B"
R) sort(c("X.Z","X0B.Z"))
[1] "X0B.Z" "X.Z"  

I am using a linux redHat 
$ uname -a
Linux 2.6.18-238.9.1.el5 #1 SMP Fri Mar 18 12:42:39 EDT 2011 x86_64 x86_64
x86_64 GNU/Linux


--
View this message in context: http://r.789695.n4.nabble.com/Sorting-strings-tp4403696p4404298.html
Sent from the R help mailing list archive at Nabble.com.
#
On Mon, Feb 20, 2012 at 05:55:30AM -0800, statquant2 wrote:
Hi.

Try this

  Sys.setlocale("LC_COLLATE", "C") 


This comes from ?locale and reads there

     Sys.setlocale("LC_COLLATE", "C")   # turn off locale-specific sorting,
                                        #  usually

See also ?sort

     The sort order for character vectors will depend on the collating
     sequence of the locale in use: see ?Comparison?.

?Comparison

     Comparison of strings in character vectors is lexicographic within
     the strings using the collating sequence of the locale in use: see
     ?locales?.  The collating sequence of locales such as ?en_US? is
     normally different from ?C? (which should use ASCII) and can be
     surprising.  Beware of making _any_ assumptions about the
     collation order: ...

Hope this helps.

Petr Savicky.
#
It seems OS-dependent. I got different results when trying it on windows 
xp and Redhat linux.


 > R.version
                _
platform       x86_64-unknown-linux-gnu
arch           x86_64
os             linux-gnu
system         x86_64, linux-gnu
status
major          2
minor          9.1
year           2009
month          06
day            26
svn rev        48839
language       R
version.string R version 2.9.1 (2009-06-26)
 > sort(c("X.","X0B"))
[1] "X."  "X0B"
 > sort(c("X.Z","X0B.Z"))
[1] "X.Z"   "X0B.Z"


 > R.version
                _
platform       x86_64-unknown-linux-gnu
arch           x86_64
os             linux-gnu
system         x86_64, linux-gnu
status
major          2
minor          9.1
year           2009
month          06
day            26
svn rev        48839
language       R
version.string R version 2.9.1 (2009-06-26)
 > sort(c("X.","X0B"))
[1] "X."  "X0B"
 > sort(c("X.Z","X0B.Z"))
[1] "X0B.Z" "X.Z"
On 2012-2-20 23:27, statquant2 wrote:
#
On Mon, Feb 20, 2012 at 04:56:21PM +0100, Petr Savicky wrote:
This is not in ?locale, but in ?locales
This in the example section at the end.

Try also to see

  Sys.getlocale()

Relevant can also be LC_CTYPE

  Sys.setlocale("LC_CTYPE", "C")

Hope this helps.

Petr Savicky.
#
Sorry, just made a mistake. This is the result from windows xp.

 > sort(c("X.","X0B"))
[1] "X."  "X0B"
 > sort(c("X.Z","X0B.Z"))
[1] "X.Z"   "X0B.Z"
 > R.version
                _
platform       i386-pc-mingw32
arch           i386
os             mingw32
system         i386, mingw32
status
major          2
minor          13.0
year           2011
month          04
day            13
svn rev        55427
language       R
version.string R version 2.13.0 (2011-04-13)
On 2012-2-21 0:13, De-Jian Zhao wrote:
#
On 2012-2-20 23:15, Rui Barradas wrote:
Yes, it seems. I tried it on my local windows xp and redhat linux 
server, and got different results. Hope it will be fixed in the future 
versions. Maybe we should keep alert to check whether the results are 
consistent when transferring our code from one platform to another.


 > sort(c("X.","X0B"))
[1] "X."  "X0B"
 > sort(c("X.Z","X0B.Z"))
[1] "X0B.Z" "X.Z"
 > R.version
                _
platform       x86_64-unknown-linux-gnu
arch           x86_64
os             linux-gnu
system         x86_64, linux-gnu
status
major          2
minor          9.1
year           2009
month          06
day            26
svn rev        48839
language       R
version.string R version 2.9.1 (2009-06-26)



 > sort(c("X.","X0B"))
[1] "X."  "X0B"
 > sort(c("X.Z","X0B.Z"))
[1] "X.Z"   "X0B.Z"
 > R.version
                _
platform       i386-pc-mingw32
arch           i386
os             mingw32
system         i386, mingw32
status
major          2
minor          13.0
year           2011
month          04
day            13
svn rev        55427
language       R
version.string R version 2.13.0 (2011-04-13)
#
On 20-Feb-2012 Petr Savicky wrote:
I've been following this thread with interest. I had begun composing
a reply on similar lines to Petr's above, but put it on one side
while waiting to see how the thread would evolve.

In view of the tangle of mixed experiences reported by different
users, I now wonder whether we should have something like "lc_collate"
as a specific parameter for sort(), e.g. so that one can set, for a
particular sorting operation,

   sort(c("X.","X0B"),lc_collate="C")

without affecting the system "LC_COLLATE" setting (i.e. the change
takes effect only within the execution of that sort() command).

Ted.

-------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at wlandres.net>
Date: 20-Feb-2012  Time: 17:16:47
This message was sent by XFMail