Dear R developers,
I suggest to modify the behaviour of "grep" function with fixed=TRUE option.
Currently, fixed=TRUE implies ignore.case=FALSE (overrides ignore.case=TRUE,
if set by the user).
I suggest to keep ignore.case as set by the user even if fixed=TRUE. Since
the default of ignore.case is FALSE, this would not change the behaviour
of grep, if the user does not set ignore.case explicitly.
In my opinion, fixed=TRUE is most useful for suppressing meta-character
expansion. On the other hand, for a simple word search, ignoring
case is sometimes useful.
If for some reason, it is better to keep the current behavior of grep, then I
suggest to extend the documentation as follows:
ORIGINAL:
fixed: logical. If 'TRUE', 'pattern' is a string to be matched as
is. Overrides all conflicting arguments.
SUGGESTED:
fixed: logical. If 'TRUE', 'pattern' is a string to be matched as
is. Overrides all conflicting arguments including ignore.case.
All the best, Petr Savicky.
grep with fixed=TRUE and ignore.case=TRUE
7 messages · Gabor Grothendieck, Petr Savicky, Brian Ripley
Seems like a good idea to me.
Here is a workaround that works in any event which combines (?i), \Q and \E .
to get the same effect. (?i) gives case insensitive matches and \Q and \E
quote and endquote the intervening text disabling special characters:
x <- c("D.G cat", "d.g cat", "dog cat")
z <- "d.g"
rx <- paste("(?i)\\Q", z, "\\E", sep = "")
grep(rx, x, perl = TRUE) # 1 2
On 5/7/07, Petr Savicky <savicky at cs.cas.cz> wrote:
Dear R developers,
I suggest to modify the behaviour of "grep" function with fixed=TRUE option.
Currently, fixed=TRUE implies ignore.case=FALSE (overrides ignore.case=TRUE,
if set by the user).
I suggest to keep ignore.case as set by the user even if fixed=TRUE. Since
the default of ignore.case is FALSE, this would not change the behaviour
of grep, if the user does not set ignore.case explicitly.
In my opinion, fixed=TRUE is most useful for suppressing meta-character
expansion. On the other hand, for a simple word search, ignoring
case is sometimes useful.
If for some reason, it is better to keep the current behavior of grep, then I
suggest to extend the documentation as follows:
ORIGINAL:
fixed: logical. If 'TRUE', 'pattern' is a string to be matched as
is. Overrides all conflicting arguments.
SUGGESTED:
fixed: logical. If 'TRUE', 'pattern' is a string to be matched as
is. Overrides all conflicting arguments including ignore.case.
All the best, Petr Savicky.
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
1 day later
On Mon, 7 May 2007, Petr Savicky wrote:
Dear R developers, I suggest to modify the behaviour of "grep" function with fixed=TRUE option. Currently, fixed=TRUE implies ignore.case=FALSE (overrides ignore.case=TRUE, if set by the user).
As it clearly says it does.
I suggest to keep ignore.case as set by the user even if fixed=TRUE. Since the default of ignore.case is FALSE, this would not change the behaviour of grep, if the user does not set ignore.case explicitly. In my opinion, fixed=TRUE is most useful for suppressing meta-character expansion. On the other hand, for a simple word search, ignoring case is sometimes useful.
Well, it was written to use in R's own code as a quick way to match a fixed sequence of bytes. It is not suitable for a 'word' search as it does not (just) match to words.
If for some reason, it is better to keep the current behavior of grep, then I
suggest to extend the documentation as follows:
ORIGINAL:
fixed: logical. If 'TRUE', 'pattern' is a string to be matched as
is. Overrides all conflicting arguments.
SUGGESTED:
fixed: logical. If 'TRUE', 'pattern' is a string to be matched as
is. Overrides all conflicting arguments including ignore.case.
Oh come on, ignore.case clearly conflicts with 'as is'! Adding unnecessary qualifiers just makes the text harder to read. I suggest you collaborate with the person who replied that he thought this was a good idea to supply patches against the R-devel sources for scrutiny.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
2 days later
On Wed, May 09, 2007 at 06:41:23AM +0100, Prof Brian Ripley wrote:
I suggest you collaborate with the person who replied that he thought this was a good idea to supply patches against the R-devel sources for scrutiny.
A possible solution is to use strncasecmp instead of strncmp in function fgrep_one in R-devel/src/main/character.c. Corresponding modification of character.c is at http://www.cs.cas.cz/~savicky/ignore_case/character.c and diff file w.r.t. the original character.c (downloaded today) is at http://www.cs.cas.cz/~savicky/ignore_case/diff.txt This seems to work in my installation of R-devel: > x <- c("D.G cat", "d.g cat", "dog cat") > z <- "d.g" > grep(z, x, ignore.case = F, fixed = T) [1] 2 > grep(z, x, ignore.case = T, fixed = T) # this is the new behavior [1] 1 2 > grep(z, x, ignore.case = T, fixed = F) [1] 1 2 3 > Since fgrep_one is used many times in character.c, adding igcase_opt as an additional argument would imply extensive changes to the file. So, I introduced a new function fgrep_one_igcase called only once in the file. Another solution is possible. I do not understand well handling multibyte chars, so I did not test the function with real multibyte chars, although the code for this option is used. Ignore case option is not meaningfull in gsub. It could be meaningful in regexpr, however, this function does not allow ignore.case option, so I did no changes to it. All the best, Petr.
2 days later
On Fri, 11 May 2007, Petr Savicky wrote:
On Wed, May 09, 2007 at 06:41:23AM +0100, Prof Brian Ripley wrote:
I suggest you collaborate with the person who replied that he thought this was a good idea to supply patches against the R-devel sources for scrutiny.
A possible solution is to use strncasecmp instead of strncmp in function fgrep_one in R-devel/src/main/character.c. Corresponding modification of character.c is at http://www.cs.cas.cz/~savicky/ignore_case/character.c and diff file w.r.t. the original character.c (downloaded today) is at http://www.cs.cas.cz/~savicky/ignore_case/diff.txt This seems to work in my installation of R-devel:
> x <- c("D.G cat", "d.g cat", "dog cat")
> z <- "d.g"
> grep(z, x, ignore.case = F, fixed = T)
[1] 2
> grep(z, x, ignore.case = T, fixed = T) # this is the new behavior
[1] 1 2
> grep(z, x, ignore.case = T, fixed = F)
[1] 1 2 3
>
Since fgrep_one is used many times in character.c, adding igcase_opt as an additional argument would imply extensive changes to the file. So, I introduced a new function fgrep_one_igcase called only once in the file. Another solution is possible. I do not understand well handling multibyte chars, so I did not test the function with real multibyte chars, although the code for this option is used.
Thanks for looking into this. strncasecmp is not standard C (not even C99), but R does have a substitute for it. Unfortunately strncasecmp is not usable with multibyte charsets: Linux systems have wcsncasecmp but that is not portable. In these days of widespread use of UTF-8 that is a blocking issue, I am afraid. In the case of grep I think all you need is grep(tolower(pattern), tolower(x), fixed = TRUE) and similarly for regexpr.
Ignore case option is not meaningfull in gsub.
sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)
is different from 'ignore.case=FALSE', and I see the meaning as clear.
So what did you mean? (Unfortunately the tolower trick does not work for
[g]sub.)
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
3 days later
strncasecmp is not standard C (not even C99), but R does have a substitute for it. Unfortunately strncasecmp is not usable with multibyte charsets: Linux systems have wcsncasecmp but that is not portable. In these days of widespread use of UTF-8 that is a blocking issue, I am afraid.
What could help are the functions mbrtowc and towctrans and simple long integer comparison. Are the functions mbrtowc and towctrans available under Windows? mbrtowc seems to be available as Rmbrtowc in src/gnuwin32/extra.c. I did not find towctrans defined in R sources, but it is in gnuwin32/Rdll.hide and used in do_tolower. Does this mean that tolower is not usable with utf-8 under Windows?
In the case of grep I think all you need is grep(tolower(pattern), tolower(x), fixed = TRUE) and similarly for regexpr.
Yes. this is correct, but it has disadvantages. It needs more space and, if value=TRUE, we would have to do something like x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)] This is hard to implement in src/library/base/R/grep.R, where the call to .Internal(grep(pattern,...)) is the last command and I think this should be preserved.
Ignore case option is not meaningfull in gsub.
sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)
is different from 'ignore.case=FALSE', and I see the meaning as clear.
So what did you mean? (Unfortunately the tolower trick does not work for
[g]sub.)
The meaning of ignore.case in [g]sub is problematic due to the following.
sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE)
produces
[1] "xyzD" "xyzd"
but the user may in fact need the following
[1] "XYZD" "xyzd"
It is correct that "xyzD" "xyzd" is produced, but the user
should be aware of the fact that several substitutions like
x <- sub("abc", "xyz", c("ABCD", "abcd")) # ignore.case=FALSE
sub("ABC", "XYZ", x) # ignore.case=FALSE
may be more useful.
I have another question concerning the speed of grep. I expected that
fgrep_one function is slower than calling a library routine
for regular expressions. In particular, if the pattern has a lot of
long partial matches in the target string, I expected that it may be much
slower. A short example is
y <- "aaaaaaaaab"
x <- "aaaaaaaaaaaaaaaaaaab"
grep(y,x)
which requires 110 comparisons (10 comparisons for each of 11 possible
beginnings of y in x). In general, the complexity in the worst case is
O(m*n), where m,n are the lengths of y,x resp. I would expect that
the library function for matching regular expressions needs
time O(m+n) and so will be faster. However, the result obtained
on a larger example is
> x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "")
> x2 <- paste(c("b", rep("a", times = 1000)), collapse = "")
> y <- paste(c(rep("a", times = 10000), x2), collapse = "")
> z <- rep(y, times = 100)
> system.time(i <- grep(x1, z, fixed = T))
[1] 1.970 0.000 1.985 0.000 0.000
> system.time(i <- grep(x1, z, fixed = F)) # reg. expr. surprisingly slow (*)
[1] 40.374 0.003 40.381 0.000 0.000
> system.time(i <- grep(x2, z, fixed = T))
[1] 0.113 0.000 0.113 0.000 0.000
> system.time(i <- grep(x2, z, fixed = F)) # reg. expr. faster than fgrep_one
[1] 0.019 0.000 0.019 0.000 0.000
Do you have an explanation of these results, in particular (*)?
Petr.
On Thu, 17 May 2007, Petr Savicky wrote:
strncasecmp is not standard C (not even C99), but R does have a substitute for it. Unfortunately strncasecmp is not usable with multibyte charsets: Linux systems have wcsncasecmp but that is not portable. In these days of widespread use of UTF-8 that is a blocking issue, I am afraid.
What could help are the functions mbrtowc and towctrans and simple long integer comparison. Are the functions mbrtowc and towctrans available under Windows? mbrtowc seems to be available as Rmbrtowc in src/gnuwin32/extra.c. I did not find towctrans defined in R sources, but it is in gnuwin32/Rdll.hide
I don't see it in Rdll.hide. It is a C99 function (see your unix man page).
and used in do_tolower. Does this mean that tolower is not usable with utf-8 under Windows?
UTF-8 is not usable under Windows, but tolower works in Windows DBCS (in so far as that makes sense: Chinese chars do not have 'case'). Rmbrtowc reflects an attempt to add UTF-8 support on Windows, but that is not currently active.
In the case of grep I think all you need is grep(tolower(pattern), tolower(x), fixed = TRUE) and similarly for regexpr.
Yes. this is correct, but it has disadvantages. It needs more space and, if value=TRUE, we would have to do something like x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)] This is hard to implement in src/library/base/R/grep.R, where the call to .Internal(grep(pattern,...)) is the last command and I think this should be preserved.
Ignore case option is not meaningfull in gsub.
sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)
is different from 'ignore.case=FALSE', and I see the meaning as clear.
So what did you mean? (Unfortunately the tolower trick does not work for
[g]sub.)
The meaning of ignore.case in [g]sub is problematic due to the following.
sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE)
produces
[1] "xyzD" "xyzd"
but the user may in fact need the following
[1] "XYZD" "xyzd"
He may, but that is not what 'ignore case' means, more like 'case honouring'.
It is correct that "xyzD" "xyzd" is produced, but the user
should be aware of the fact that several substitutions like
x <- sub("abc", "xyz", c("ABCD", "abcd")) # ignore.case=FALSE
sub("ABC", "XYZ", x) # ignore.case=FALSE
may be more useful.
I have another question concerning the speed of grep. I expected that
fgrep_one function is slower than calling a library routine
for regular expressions. In particular, if the pattern has a lot of
long partial matches in the target string, I expected that it may be much
slower. A short example is
y <- "aaaaaaaaab"
x <- "aaaaaaaaaaaaaaaaaaab"
grep(y,x)
which requires 110 comparisons (10 comparisons for each of 11 possible
beginnings of y in x). In general, the complexity in the worst case is
O(m*n), where m,n are the lengths of y,x resp. I would expect that
the library function for matching regular expressions needs
time O(m+n) and so will be faster. However, the result obtained
on a larger example is
> x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "")
> x2 <- paste(c("b", rep("a", times = 1000)), collapse = "")
> y <- paste(c(rep("a", times = 10000), x2), collapse = "")
> z <- rep(y, times = 100)
> system.time(i <- grep(x1, z, fixed = T))
[1] 1.970 0.000 1.985 0.000 0.000
> system.time(i <- grep(x1, z, fixed = F)) # reg. expr. surprisingly slow (*)
[1] 40.374 0.003 40.381 0.000 0.000
> system.time(i <- grep(x2, z, fixed = T))
[1] 0.113 0.000 0.113 0.000 0.000
> system.time(i <- grep(x2, z, fixed = F)) # reg. expr. faster than fgrep_one
[1] 0.019 0.000 0.019 0.000 0.000 Do you have an explanation of these results, in particular (*)?
Yes, there is a comment on the help page to that effect. But these are highly atypical uses. Try perl=TRUE, and be aware that the locale matters a lot in such tests (via the charset). No one is attempting to make R a fast string-processing language and so developers resources are spent on performance where it matters to more typical usage. (E.g. reducing duplication in as.double and friends speeds up just about every R session, and speeds up some numerical sessions dramatically.)
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595