Skip to content

invalid regular expression '[a-Z]'

6 messages · Duncan Murdoch, Henrik Bengtsson, Brian Ripley

#
Hi,

just curious, but does anyone know the source/reason of observing the
following error on OSX but not on WinXP and Linux?  I've tried with a
few different versions of R (v2.5.1, v2.6.1, v2.6.2, v2.7.0devel).
The locale does not seem to affect the error, i.e. I've tested a few
different and it is still only OSX that gives the error but not the
other two.
Error in regexpr(pattern, text, extended, fixed, useBytes) :
        invalid regular expression '[a-Z]'
[1] 1
attr(,"match.length")
[1] 1
[1] 1
attr(,"match.length")
[1] 1

At least now I know it that the safest is to use '[a-zA-Z]' (or
possibly '[[:alpha:]]').

/Henrik
#
On 05/03/2008 8:56 PM, Henrik Bengtsson wrote:
Presumably in the locale you're using on OSX, "a" < "Z" is false.  This 
is the ascii sort order used in the C locale.  On my Windows box, "a" < 
"Z" is true, because it uses the English_Canada.1252 collation order.

Duncan Murdoch

  I've tried with a
#
On Wed, Mar 5, 2008 at 6:18 PM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
That's it indeed.  The person who first reported the error had
sessionInfo() locale
'en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8' and I
missed that 'C' in the middle, which I guess his system falls back to
if none of the previous ones exist?!?

Now I can reproduce it on both Windows and Linux:
[1] "C"
Error in regexpr("[a-Z]", "foo") : invalid regular expression '[a-Z]'
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;L
C_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States
.1252"
[1] 1
attr(,"match.length")
[1] 1

Case almost closed, but then the question is why don't you get an
error in one of the two cases '[a-Z]' and '[A-z]' then with the other
locale(s)?
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;L
C_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States
.1252"
[1] 1
attr(,"match.length")
[1] 1
[1] 1
attr(,"match.length")
[1] 1
[1] TRUE
[1] FALSE

Thanks

/Henrik
#
On Wed, Mar 5, 2008 at 6:40 PM, Henrik Bengtsson <hb at stat.berkeley.edu> wrote:
My bad...
[1] TRUE
[1] FALSE
Error in regexpr("[z-A]", "foo") : invalid regular expression '[z-A]'

Case closed

/Henrik
#
On Wed, 5 Mar 2008, Henrik Bengtsson wrote:

            
No.  Those are settings for various categories, just as you showed for 
Window.  The first setting appears to be LC_COLLATE, but what they mean is 
not documented on the system man page for setlocale.

It's just that MacOS uses C collation order in English locales, even 
though almost everyone else uses aAbB or AaBb (the latter being what the 
English actually use, as do almost all book indices in dialects of 
English).  But then there is no surprise that MacOS has to be different 
... its implementaton of locales is idiosyncratic (to be generous).

Note that even [A-Za-z] is unsafe -- as I recall Z is in the middle of the 
alphabet in Estonian locales.  If you want alphabetic characters, use 
[[:alpha:]].  If you want ASCII alphabetic characters, write out the 
ranges as [AB...Zab...z]

E.g. (F8 Linux)
[1] "et_EE.utf8"
[1] "AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsZzTtUuVvWwXxYy"


[...]
#
On Wed, Mar 5, 2008 at 11:09 PM, Prof Brian Ripley
<ripley at stats.ox.ac.uk> wrote:
Alpha and Omega - you said it all.

Thanks for the clarifications.

/Henrik