Skip to content

regex -> negate a word

24 messages · Rau, Roland, jim holtman, Eric Archer +5 more

#
Dear all,

let's assume I have a vector of character strings:

x <- c("abcdef", "defabc", "qwerty")

What I would like to find is the following: all elements where the word
'abc' does not appear (i.e. 3 in this case of 'x').

Since I am not really experienced with regular expressions, I started
slowly and thought I find all word were 'abc' actually does appear:
[1] 1 2

So far, so good. Now I read that ^ is the negation operator. But it can
also denote the beginning of a string as in:
[1] 1

Of course, we need to put it inside square brackets to negate the
expression [1]
[1] 1 2 3

But this is not what I want either.

I'd appreciate any help. I assume this is rather easy and
straightforward.

Thanks,
Roland


[1] http://www.zytrax.com/tech/web/regex.htm: The ^ (circumflex or
caret) inside square brackets negates the expression....

----------
This mail has been sent through the MPI for Demographic Research.  Should you receive a mail that is apparently from a MPI user without this text displayed, then the address has most likely been faked. If you are uncertain about the validity of this message, please check the mail header or ask your system administrator for assistance.
#
Just remove those elements that match:
[1] "qwerty"

        
On Sun, Jan 18, 2009 at 1:35 PM, Rau, Roland <Rau at demogr.mpg.de> wrote:

  
    
#
Rau, Roland wrote:
a quick shot is:

x[-grep("abc", x)]

which unfortunately fails if none of the strings in x matches the
pattern, i.e., grep returns integer(0); arguably, x[integer(0)] should
rather return all elements of x:

"An empty index selects all values" (from ?'[')

but apparently integer(0) does not count as an empty index (and neither
does NULL).  so you may want something like:

strings = c("abcdef", "defabc", "qwerty")
pattern = "abc"
if (length(matching <- grep(pattern, strings))) x[-matching] else x

vQ
#
Try this:

# indexes
setdiff(seq_along(x), grep("abc", x))

# values
setdiff(x, grep("abc", x, value = TRUE))

Another possibility is:

z <- "abc"
x0 <- c(x, z) # to handle no match case
x0[- grep(z, x0)] # values
On Sun, Jan 18, 2009 at 1:35 PM, Rau, Roland <Rau at demogr.mpg.de> wrote:
#
Roland,

I think you were almost there with your first example.  Howabout using:

 > x <- c("abcdef", "defabc", "qwerty")
 > y <- grep(pattern="abc", x=x)
 > z.char <- x[-y]
 > z.index <- (1:length(x))[-y]
 >
 > z.char
[1] "qwerty"
 > z.index
[1] 3

Cheers,
eric
Rau, Roland wrote:

  
    
#
Try this:

grep("^([^a]|a[^b]|ab[^c])*.{0,2}$", x, perl = TRUE)
On Sun, Jan 18, 2009 at 2:37 PM, Rau, Roland <Rau at demogr.mpg.de> wrote:
#
Gabor Grothendieck wrote:
on quick testing, these two and the if-based version have comparable
runtime, with a minor win for the last one, and if the input is moderate
this makes no real difference.

however, the second solution above is likely to fail if the pattern is
more complex, e.g., contains a character class or a wildcard:

strings = c("xyz")
pattern = "a[a-z]"
strings[-grep(pattern, c(strings, pattern))]
# character(0)


vQ
#
In that case just add fixed = TRUE

On Sun, Jan 18, 2009 at 2:58 PM, Wacek Kusnierczyk
<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
#
Gabor Grothendieck wrote:
in general, if you want a complex pattern, you don't use 'fixed', and
then again you risk incorrect (well, correct for r, but not for the
problem) result in case no input string matches the pattern.


vQ
#
Gabor Grothendieck wrote:
... and see how cumbersome it becomes for a pattern as trivial as 'abc'. 

in perl, you typically don't invent such negative patterns, but rather
"don't match" positive patterns: instead of the match operator =~ and a
negative pattern, you use the no-match operator !~ and a positive pattern:

@strings = ("abc", "xyz");
@filtered = grep $_ !~ /abc/, @strings;

in r, one way to do the no-match is using -grep, but taking care of the
special case of no matches at all in the input vector.
in perl 5.10, you can try this:

@strings = ("abc", "xyz");
@filtered = grep $_ =~ /(abc)(*COMMIT)(*FAIL)|(*ACCEPT)/, @strings;

which works by making a string that matches the pattern fail, and any
other string succeed despite no match.

vQ
#
Wacek Kusnierczyk wrote:
incidentally, recent pcre accepts such regexes:

# r code
ungrep = function(pattern, x, ...)
    grep(paste(pattern, "(*COMMIT)(*FAIL)|(*ACCEPT)", sep=""), x,
perl=TRUE, ...)

strings = c("abc", "xyz")
pattern = "a[a-z]"
(filtered = strings[ungrep(pattern, strings)])
# "xyz"

vQ
#
Wacek Kusnierczyk wrote:
this was a toy example, but if you need this sort of ungrep with
patterns involving alterations, you need a fix:

ungrep("a|x", strings, value=TRUE)
# "abc"
# NOT character(0)

# fix
ungrep = function(pattern, x, ...)
    grep(paste("(?:", pattern, ")(*COMMIT)(*FAIL)|(*ACCEPT)", sep=""),
x, perl=TRUE, ...)

ungrep("a|x", strings, value=TRUE)
# character(0)


vQ
#
Well, that's why it was only provided when you insisted.  This is
not what regexp's are good at.
On Sun, Jan 18, 2009 at 4:35 PM, Rau, Roland <Rau at demogr.mpg.de> wrote:
#
On 19/01/2009, at 10:44 AM, Gabor Grothendieck wrote:

            
It may not be what regexp's are good at, but the grep command in unix/ 
linux
does what is required *very* simply via the ``-v'' flag.  I  
conjecture that
it would not be difficult to add an argument with similar impact to the
grep() function in R.

	cheers,

		Rolf Turner

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}
#
That's an entirely different point from whether regular expressions
can do it as grep -v is just another way to do it without using a regular
expression to specify the entire job.
On Sun, Jan 18, 2009 at 5:02 PM, Rolf Turner <r.turner at auckland.ac.nz> wrote:
#
On Sun, Jan 18, 2009 at 2:22 PM, Wacek Kusnierczyk
<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:
Yes.
The meaning of x[V] (for an integer subscript vector V) is: ignore 0
entries, and then:

a) if !(all(V>0) | all(V<0) ) => ERROR
b) if all (V>0): length(x[V]) == length(V)
c) if all (V<0): length(x[V]) == length(x)-length(unique(V))

When length(V)==0, the preconditions are true for both (b) and (c), so
the R design has made the decision that length(x[V]) == 0 in this
case.  If you're going to have the "negative indices means exclusion"
trick, this seems like a reasonable convention.

Of course, that means that you can't in general use x[-V] (where
all(V>0)) to mean "all elements that are not in V".  However, there is
a workaround if you have an upper bound on length(x):

       x[ c(-2^30, -V) ]

This guarantees at least one negative number.

           -s
#
Note that the variation of this that I posted already handles that case.
On Sun, Jan 18, 2009 at 5:32 PM, Stavros Macrakis <macrakis at alum.mit.edu> wrote:
#
Rolf Turner wrote:
something like grep(..., inverse=TRUE), perhaps.

vQ
#
Stavros Macrakis wrote:
what about numeric vectors?  r performs smart downcasting here:

x[1.1]
# same as x[1]

x[0.3]
# character(0)
what if V=NULL?
there is no error for x[v] with V=0, V=as.numeric(NA), or V=NaN.
unfortunately, false if v contains a non-integer (so it goes beyond your
discussion, but may cause problems in practice):

x[c(1, 0.5)]
# one item (if x is non-empty)
not true for cases like V=c(-1, -1.5), which again go beyond your
discussion, but may happen in practice.

interestingly, unique(c(NA, NA)) is just NA, rather than c(NA,NA).  i'd
think that if we have two non-available values, we can't be sure they're
in fact equal, but unique apparently is.  (you'd have to tell it not to
be with incomparables=NA.)
interestingly, all(V>0) & all(V<0) is TRUE for V=c().
i didn't say this was unreasonable, just that x[integer(0)] should,
arguably, return x.  'empty index' is not as precise an expression to be
sure that it will be obvious to everyone that integer(0) is *not* an
empty index, and less so with NULL.  what is meant, i guess, is 'empty
index expression', i.e., no index rather than empty index, and i'd
humbly suggest (risking being charged with boring pedantry) to improve tfm.


vQ
#
On Mon, 19 Jan 2009, Rolf Turner wrote:

            
Indeed.  I have often wondered why grep() returned indices, when a 
logical vector would seem more natural in R (and !grep(...) would have 
been all that was needed).

Looking at the code I see it does in fact compute a logical vector, 
just not return it.  So adding 'invert' (the long-form of -v is 
--invert) is a job of a very few lines and I have done so for 2.9.0.
1 day later
#
Prof Brian Ripley wrote:
in fact, it's simpler than that.  instead of redundantly distributing
the fix over four different lines in character.c, it's enough to ^= the
logical vector of matched/unmatched flags in just one place, on-the-fly,
close to the end of the loop over the vector of input strings.  see
attached patch.

for consistency, you might want to
- name the internal invert flag 'invert_opt' instead of 'invert';
- apply the same fix to agrep.

it's also trivial to add another argument to grep, say 'logical', which
will cause grep to return a logical vector of the same length as the
input strings vector.  see the attached patch.  note: i am novice to r
internals, and i get some mystical warnings i haven't decoded yet while
using the extended grep, but otherwise the code compiles well and grep
works as intended; you'd need to fix the cause of the warnings.

if you want the 'logical' argument, you need to decide how it interacts
with 'values'.  in the patch, 'values' set to TRUE resets 'logical' to
FALSE, with a warning.

further suggestions:  the arguments 'values' and 'logical' could be
replaced with one argument, say 'output', which would take a value from
{'indices', 'values', 'logical'}.  it might make further extensions
easier to implement and maintain.

attached are patches to character.c, names.c, and grep.R; if you tell me
which other files need a patch to get rid of the warnigns (see below),
i'll make one. 

s = c("abc", "bcd", "cde")

grep("b", s)
# 1 2

grep("b", s, value=TRUE)
# "abc" "bcd"

grep("b", s, logical=TRUE)
# TRUE TRUE FALSE

s[grep("b", s, logical=TRUE)]
# "abc" "bcd"
# Warning: stack imbalance in 'grep', 9 then 10
# Warning: stack imbalance in '.Internal', 8 then 9
# Warning: stack imbalance in '{', 6 then 7

grep("b", s, invert=TRUE)
# 3

grep("b", s, invert=TRUE, value=TRUE)
# "cde"

s[!grep("b", s, logical)]
# "cde"
# Warning: stack imbalance in 'grep', 15 then 16
# Warning: stack imbalance in '.Internal', 14 then 15
# Warning: stack imbalance in '{', 12 then 13
# Warning: stack imbalance in '!', 6 then 7
# Warning: stack imbalance in '[', 2 then 3



vQ
#
Wacek Kusnierczyk wrote:
forgot to add:  the patches are against the latest r-devel
(19.01.2009).  compiled and tested on 32b Ubuntu 8.04.


vQ