Change in grep behavior from 1.9.0 to R-patched

Fri, Jun 11, 2004 11:21 AM

I think I have a solution I am just about to commit.  It looks as if the 
PCRE documentation I read is wrong as to when it is safe to free the 
locale-specific tables, and I've deferred doing so until much later.

Incidentally, I cannot make this misbehave on Windows.

On Fri, 11 Jun 2004, Prof Brian Ripley wrote:

So the consensus is

- it happens equally in 1.9.0 and 1.9.1 alpha current
- it happens in the C locale
- it is random and bursty, as in

   [1] 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 
84 84
  [25] 84 84 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 
13 13
  [49] 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 
13 13
  [73] 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 
13 13
  [97] 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 
13 13
 [121] 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 
13 13
 [145] 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 84 84 84 84 
84 84
 [169] 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 
84 84
 [193] 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 13 13 84 84 84 13 
13 13
 [217] 84 84 84 13 13 13 84 84 84 13 13 13 84 84 84 13 13 13 13 13 13 13 
13 13
...

So looks like a problem in the PCRE compiled code.

On Fri, 11 Jun 2004, Marc Schwartz wrote:

On Fri, 2004-06-11 at 10:28, Prof Brian Ripley wrote:

This is actually PCRE.  Something is wrong with your build of R-patched
(1.9.1 alpha, I assume): I get 84 everywhere.  You are asking for a first
character l, then one or more characters of `word' then tmean.  In your
example this is the same as (in a suitable locale, including C)

length(grep("^l[A-Za-z0-9]+tmean", x, perl = TRUE, value = TRUE))

I omitted _ there, not that it mattered.

length(grep("^l[[:alnum:]_]+tmean", x, perl = TRUE, value = TRUE))

which each give 84.

One issue: PCRE is locale-dependent.  Did you use the same locale for 
each?  What happens if you force LANG=C?

(I've just checked an R-devel Solaris system.  This gave 13 on a build 
from Weds, and 84 when remade today.  The result with 13 seems truncated, 
as they are the first 13.  Might be coincidental, of course.)


The above is confirmed using Version 1.9.1 alpha (2004-06-10) on FC2:

x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
length(grep("^l[A-Za-z0-9]+tmean", x, perl = TRUE, value = TRUE))

[1] 84

length(grep("^l[[:alnum:]_]+tmean", x, perl = TRUE, value = TRUE))

[1] 84


Also, to demonstrate Roger's follow up example:

d <- replicate(1000, length(grep("^l\\w+tmean", x, perl = TRUE, value

= TRUE)))

summary(d)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.00   13.00   13.00   14.14   13.00   84.00

table(d) is more informative.

BTW: pcre-4.5-2

Did you use --with-pcre, though?

Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Change in grep behavior from 1.9.0 to R-patched

Thread (11 messages)