invalid regular expression '[a-Z]'

6 messages · Duncan Murdoch, Henrik Bengtsson, Brian Ripley

Original

1

6

Henrik Bengtsson

Wed, Mar 5, 2008 5:56 PM #

Hi,

just curious, but does anyone know the source/reason of observing the
following error on OSX but not on WinXP and Linux?  I've tried with a
few different versions of R (v2.5.1, v2.6.1, v2.6.2, v2.7.0devel).
The locale does not seem to affect the error, i.e. I've tested a few
different and it is still only OSX that gives the error but not the
other two.

Error in regexpr(pattern, text, extended, fixed, useBytes) :
        invalid regular expression '[a-Z]'

[1] 1
attr(,"match.length")
[1] 1

[1] 1
attr(,"match.length")
[1] 1

At least now I know it that the safest is to use '[a-zA-Z]' (or
possibly '[[:alpha:]]').

/Henrik

Wed, Mar 5, 2008 6:18 PM #

On 05/03/2008 8:56 PM, Henrik Bengtsson wrote:

Presumably in the locale you're using on OSX, "a" < "Z" is false.  This 
is the ascii sort order used in the C locale.  On my Windows box, "a" < 
"Z" is true, because it uses the English_Canada.1252 collation order.

Duncan Murdoch

  I've tried with a

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Henrik Bengtsson

Wed, Mar 5, 2008 6:40 PM #

On Wed, Mar 5, 2008 at 6:18 PM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:

That's it indeed.  The person who first reported the error had
sessionInfo() locale
'en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8' and I
missed that 'C' in the middle, which I guess his system falls back to
if none of the previous ones exist?!?

Now I can reproduce it on both Windows and Linux:

[1] "C"

Error in regexpr("[a-Z]", "foo") : invalid regular expression '[a-Z]'

[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;L
C_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States
.1252"

[1] 1
attr(,"match.length")
[1] 1

Case almost closed, but then the question is why don't you get an
error in one of the two cases '[a-Z]' and '[A-z]' then with the other
locale(s)?

[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;L
C_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States
.1252"

[1] 1
attr(,"match.length")
[1] 1

[1] 1
attr(,"match.length")
[1] 1

[1] TRUE

[1] FALSE

Thanks

/Henrik

Henrik Bengtsson

Wed, Mar 5, 2008 6:42 PM #

On Wed, Mar 5, 2008 at 6:40 PM, Henrik Bengtsson <hb at stat.berkeley.edu> wrote:

On Wed, Mar 5, 2008 at 6:18 PM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:

 > On 05/03/2008 8:56 PM, Henrik Bengtsson wrote:

 >  > Hi,
 >  >
 >  > just curious, but does anyone know the source/reason of observing the
 >  > following error on OSX but not on WinXP and Linux?

 >
 >  Presumably in the locale you're using on OSX, "a" < "Z" is false.  This
 >  is the ascii sort order used in the C locale.  On my Windows box, "a" <
 >  "Z" is true, because it uses the English_Canada.1252 collation order.

 That's it indeed.  The person who first reported the error had
 sessionInfo() locale
 'en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8' and I
 missed that 'C' in the middle, which I guess his system falls back to
 if none of the previous ones exist?!?

 Now I can reproduce it on both Windows and Linux:

 > Sys.setlocale("LC_ALL", "C")

 [1] "C"

 > regexpr("[a-Z]", "foo")

 Error in regexpr("[a-Z]", "foo") : invalid regular expression '[a-Z]'

 > Sys.setlocale("LC_ALL", "en")

 [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;L
 C_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States
 .1252"

 > regexpr("[a-Z]", "foo")

[1] 1
 attr(,"match.length")
 [1] 1

 Case almost closed, but then the question is why don't you get an
 error in one of the two cases '[a-Z]' and '[A-z]' then with the other
 locale(s)?

 > Sys.setlocale("LC_ALL", "en")

 [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;L
 C_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States
 .1252"

 > regexpr("[a-Z]", "foo")

[1] 1
 attr(,"match.length")
 [1] 1

 > regexpr("[A-z]", "foo")

 [1] 1
 attr(,"match.length")
 [1] 1

 > "a" < "Z"

 [1] TRUE

 > "a" > "Z"

 [1] FALSE

My bad...

[1] TRUE

[1] FALSE

Error in regexpr("[z-A]", "foo") : invalid regular expression '[z-A]'

Case closed

/Henrik

Brian Ripley

Wed, Mar 5, 2008 11:09 PM #

On Wed, 5 Mar 2008, Henrik Bengtsson wrote:

No.  Those are settings for various categories, just as you showed for 
Window.  The first setting appears to be LC_COLLATE, but what they mean is 
not documented on the system man page for setlocale.

It's just that MacOS uses C collation order in English locales, even 
though almost everyone else uses aAbB or AaBb (the latter being what the 
English actually use, as do almost all book indices in dialects of 
English).  But then there is no surprise that MacOS has to be different 
... its implementaton of locales is idiosyncratic (to be generous).

Note that even [A-Za-z] is unsafe -- as I recall Z is in the middle of the 
alphabet in Estonian locales.  If you want alphabetic characters, use 
[[:alpha:]].  If you want ASCII alphabetic characters, write out the 
ranges as [AB...Zab...z]

E.g. (F8 Linux)

[1] "et_EE.utf8"

[1] "AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsZzTtUuVvWwXxYy"


[...]

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Henrik Bengtsson

Thu, Mar 6, 2008 12:52 AM #

On Wed, Mar 5, 2008 at 11:09 PM, Prof Brian Ripley

<ripley at stats.ox.ac.uk> wrote:

On Wed, 5 Mar 2008, Henrik Bengtsson wrote:

 > On Wed, Mar 5, 2008 at 6:18 PM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:

 >> On 05/03/2008 8:56 PM, Henrik Bengtsson wrote:

 >> > Hi,
 >> >
 >> > just curious, but does anyone know the source/reason of observing the
 >> > following error on OSX but not on WinXP and Linux?

 >>
 >>  Presumably in the locale you're using on OSX, "a" < "Z" is false.  This
 >>  is the ascii sort order used in the C locale.  On my Windows box, "a" <
 >>  "Z" is true, because it uses the English_Canada.1252 collation order.

 >
 > That's it indeed.  The person who first reported the error had
 > sessionInfo() locale
 > 'en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8' and I
 > missed that 'C' in the middle, which I guess his system falls back to
 > if none of the previous ones exist?!?

 No.  Those are settings for various categories, just as you showed for
 Window.  The first setting appears to be LC_COLLATE, but what they mean is
 not documented on the system man page for setlocale.

 It's just that MacOS uses C collation order in English locales, even
 though almost everyone else uses aAbB or AaBb (the latter being what the
 English actually use, as do almost all book indices in dialects of
 English).  But then there is no surprise that MacOS has to be different
 ... its implementaton of locales is idiosyncratic (to be generous).

 Note that even [A-Za-z] is unsafe -- as I recall Z is in the middle of the
 alphabet in Estonian locales.  If you want alphabetic characters, use
 [[:alpha:]].  If you want ASCII alphabetic characters, write out the
 ranges as [AB...Zab...z]

 E.g. (F8 Linux)

 > Sys.setlocale("LC_COLLATE", "et_EE.utf8")

 [1] "et_EE.utf8"

 > paste(sort(c(letters,LETTERS)), collapse="")

 [1] "AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsZzTtUuVvWwXxYy"

Alpha and Omega - you said it all.

Thanks for the clarifications.

/Henrik