Skip to content

unexpected behaviour of sub() / usage of regexp

6 messages · Jannis, Duncan Murdoch, Brian Ripley +1 more

#
Dear R users,


the way I understand the documentation of sub() and regexp the following code: 



sub('[[:digit:]]{1,2}', '', '9ewww')



... should yield:

'ewww'


It returns, however:

'www'


Why is this the case? My code should just substitute 1 (minimum) or up to 2 (maximum) digits, i.e. numbers and not the 'e' in the string. Do I misinterpret something here?


Thanks for any ideas
Jannis
R version 2.14.0 (2011-10-31)
Platform: i686-pc-linux-gnu (32-bit)

locale:
?[1] LC_CTYPE=en_US.UTF-8?????? LC_NUMERIC=C???????????? ?
?[3] LC_TIME=en_US.UTF-8??????? LC_COLLATE=en_US.UTF-8?? ?
?[5] LC_MONETARY=en_US.UTF-8??? LC_MESSAGES=en_US.UTF-8? ?
?[7] LC_PAPER=C???????????????? LC_NAME=C??????????????? ?
?[9] LC_ADDRESS=C?????????????? LC_TELEPHONE=C?????????? ?
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C????? ?

attached base packages:
[1] stats???? graphics? grDevices utils???? datasets? methods?? base???
#
On 09/12/2011 9:20 AM, Jannis wrote:
I get your expected output of "ewww" running 2.14.0 or 2.14.0-patched on 
Windows.   So it's not a universal problem...

Duncan Murdoch
#
This is AFAICS an instance of bug PR#14408 : it seems that in UTF-8 
locales the grammar generated by the TRE engine for repetitions is in 
odd cases buggy.  And as the author has vanished, our hopes of his 
fixing it are slim.

Try perl=TRUE .
On 09/12/2011 14:20, Jannis wrote:

  
    
#
But I do get the incorrect result on R 2.14.0 on linux:
[1] "www"

And also:
[1] "www"
[1] "ww9"
[1] "ww9"

But:
[1] "ewww"
[1] "ewww"

So it seems to be something about the way the curly braces are
handled, but only with certain groups:
[1] "9www"
[1] "ewww"


But, as Prof. Ripley's email suggests, perl=TRUE solves the problem.
(I was trying out various combinations when it appeared in my inbox.)
R version 2.14.0 (2011-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
On Fri, Dec 9, 2011 at 9:25 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:

  
    
#
Thanks to all who replied. perl = TRUE indeed seems to fix the problem. It would be great, however, to prevent others from stumbling in this pitfall by fixing the issue if this is possible. But as Prof. Ripley mentioned fixing this might be difficult/impossible so we might have to live with it. 


By the way, is there an easily accessible and search able list of such bugs for R (just for the future)?


Thanks a lot
Jannis



----- Urspr?ngliche Message -----
Von: Sarah Goslee <sarah.goslee at gmail.com>
An: Duncan Murdoch <murdoch.duncan at gmail.com>
Cc: Jannis <bt_jannis at yahoo.de>; "r-help at r-project.org" <r-help at r-project.org>
Gesendet: 15:37 Freitag, 9.Dezember 2011
Betreff: Re: [R] unexpected behaviour of sub() / usage of regexp

But I do get the incorrect result on R 2.14.0 on linux:
[1] "www"

And also:
[1] "www"
[1] "ww9"
[1] "ww9"

But:
[1] "ewww"
[1] "ewww"

So it seems to be something about the way the curly braces are
handled, but only with certain groups:
[1] "9www"
[1] "ewww"


But, as Prof. Ripley's email suggests, perl=TRUE solves the problem.
(I was trying out various combinations when it appeared in my inbox.)
R version 2.14.0 (2011-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8? ? ?  LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8? ? ? ? LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8? ? LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C? ? ? ? ? ? ? ?  LC_NAME=C
[9] LC_ADDRESS=C? ? ? ? ? ? ?  LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats? ?  graphics? grDevices utils? ?  datasets? methods?  base
On Fri, Dec 9, 2011 at 9:25 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:

  
    
#
On 09/12/2011 14:49, Jannis wrote:
http://www.bugs.r-project.org

I'm not sure how obvious it would be that it is the same problem.  I 
happened to have worked on trying to solve it.