regex -> negate a word

24 messages · Rau, Roland, jim holtman, Eric Archer +5 more

Original

1

24

Rau, Roland

Sun, Jan 18, 2009 10:35 AM #

Dear all,

let's assume I have a vector of character strings:

x <- c("abcdef", "defabc", "qwerty")

What I would like to find is the following: all elements where the word
'abc' does not appear (i.e. 3 in this case of 'x').

Since I am not really experienced with regular expressions, I started
slowly and thought I find all word were 'abc' actually does appear:

[1] 1 2

So far, so good. Now I read that ^ is the negation operator. But it can
also denote the beginning of a string as in:

[1] 1

Of course, we need to put it inside square brackets to negate the
expression [1]

[1] 1 2 3

But this is not what I want either.

I'd appreciate any help. I assume this is rather easy and
straightforward.

Thanks,
Roland


[1] http://www.zytrax.com/tech/web/regex.htm: The ^ (circumflex or
caret) inside square brackets negates the expression....

----------
This mail has been sent through the MPI for Demographic Research.  Should you receive a mail that is apparently from a MPI user without this text displayed, then the address has most likely been faked. If you are uncertain about the validity of this message, please check the mail header or ask your system administrator for assistance.

jim holtman

Sun, Jan 18, 2009 11:17 AM #

Just remove those elements that match:

[1] "qwerty"

On Sun, Jan 18, 2009 at 1:35 PM, Rau, Roland <Rau at demogr.mpg.de> wrote:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

Wacek Kusnierczyk

Sun, Jan 18, 2009 11:22 AM #

Rau, Roland wrote:

a quick shot is:

x[-grep("abc", x)]

which unfortunately fails if none of the strings in x matches the
pattern, i.e., grep returns integer(0); arguably, x[integer(0)] should
rather return all elements of x:

"An empty index selects all values" (from ?'[')

but apparently integer(0) does not count as an empty index (and neither
does NULL).  so you may want something like:

strings = c("abcdef", "defabc", "qwerty")
pattern = "abc"
if (length(matching <- grep(pattern, strings))) x[-matching] else x

vQ

Gabor Grothendieck

Sun, Jan 18, 2009 11:28 AM #

Try this:

# indexes
setdiff(seq_along(x), grep("abc", x))

# values
setdiff(x, grep("abc", x, value = TRUE))

Another possibility is:

z <- "abc"
x0 <- c(x, z) # to handle no match case
x0[- grep(z, x0)] # values

On Sun, Jan 18, 2009 at 1:35 PM, Rau, Roland <Rau at demogr.mpg.de> wrote:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Eric Archer

Sun, Jan 18, 2009 11:36 AM #

Roland,

I think you were almost there with your first example.  Howabout using:

 > x <- c("abcdef", "defabc", "qwerty")
 > y <- grep(pattern="abc", x=x)
 > z.char <- x[-y]
 > z.index <- (1:length(x))[-y]
 >
 > z.char
[1] "qwerty"
 > z.index
[1] 3

Cheers,
eric

Rau, Roland wrote:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Eric Archer, Ph.D.
Southwest Fisheries Science Center
8604 La Jolla Shores Dr.
La Jolla, CA 92037
858-546-7121 (work)
858-546-7003 (FAX)

ETP Cetacean Assessment Program: http://swfsc.noaa.gov/prd-etp.aspx
Population ID Program: http://swfsc.noaa.gov/prd-popid.aspx



"Innocence about Science is the worst crime today."
    - Sir Charles Percy Snow

"Lighthouses are more helpful than churches."
    - Benjamin Franklin

    "...but I'll take a GPS over either one."
        - John C. "Craig" George

Rau, Roland

Sun, Jan 18, 2009 11:37 AM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090118/037227f0/attachment-0001.pl>

Gabor Grothendieck

Sun, Jan 18, 2009 11:54 AM #

Try this:

grep("^([^a]|a[^b]|ab[^c])*.{0,2}$", x, perl = TRUE)

On Sun, Jan 18, 2009 at 2:37 PM, Rau, Roland <Rau at demogr.mpg.de> wrote:

Thank you very much to all of you for your fast and excellent help.
Since the "-grep(...)" solution seems to be favored by most of the answers,
I just wonder if there is really no regular expression which does the job?!?

Thanks again,
Roland



-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
Sent: Sun 1/18/2009 8:28 PM
To: Rau, Roland
Cc: r-help at r-project.org
Subject: Re: [R] regex -> negate a word

Try this:

# indexes
setdiff(seq_along(x), grep("abc", x))

# values
setdiff(x, grep("abc", x, value = TRUE))

Another possibility is:

z <- "abc"
x0 <- c(x, z) # to handle no match case
x0[- grep(z, x0)] # values




On Sun, Jan 18, 2009 at 1:35 PM, Rau, Roland <Rau at demogr.mpg.de> wrote:

Dear all,

let's assume I have a vector of character strings:

x <- c("abcdef", "defabc", "qwerty")

What I would like to find is the following: all elements where the word
'abc' does not appear (i.e. 3 in this case of 'x').

Since I am not really experienced with regular expressions, I started
slowly and thought I find all word were 'abc' actually does appear:

grep(pattern="abc", x=x)

[1] 1 2

So far, so good. Now I read that ^ is the negation operator. But it can
also denote the beginning of a string as in:

grep(pattern="^abc", x=x)

[1] 1

Of course, we need to put it inside square brackets to negate the
expression [1]

grep(pattern="[^abc]", x=x)

[1] 1 2 3

But this is not what I want either.

I'd appreciate any help. I assume this is rather easy and
straightforward.

Thanks,
Roland


[1] http://www.zytrax.com/tech/web/regex.htm: The ^ (circumflex or
caret) inside square brackets negates the expression....

----------
This mail has been sent through the MPI for Demographic Research.  Should
you receive a mail that is apparently from a MPI user without this text
displayed, then the address has most likely been faked. If you are uncertain
about the validity of this message, please check the mail header or ask your
system administrator for assistance.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Wacek Kusnierczyk

Sun, Jan 18, 2009 11:58 AM #

Gabor Grothendieck wrote:

on quick testing, these two and the if-based version have comparable
runtime, with a minor win for the last one, and if the input is moderate
this makes no real difference.

however, the second solution above is likely to fail if the pattern is
more complex, e.g., contains a character class or a wildcard:

strings = c("xyz")
pattern = "a[a-z]"
strings[-grep(pattern, c(strings, pattern))]
# character(0)


vQ

Gabor Grothendieck

Sun, Jan 18, 2009 12:02 PM #

In that case just add fixed = TRUE

On Sun, Jan 18, 2009 at 2:58 PM, Wacek Kusnierczyk

<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:

Wacek Kusnierczyk

Sun, Jan 18, 2009 12:23 PM #

Gabor Grothendieck wrote:

in general, if you want a complex pattern, you don't use 'fixed', and
then again you risk incorrect (well, correct for r, but not for the
problem) result in case no input string matches the pattern.


vQ

Wacek Kusnierczyk

Sun, Jan 18, 2009 12:50 PM #

Gabor Grothendieck wrote:

... and see how cumbersome it becomes for a pattern as trivial as 'abc'. 

in perl, you typically don't invent such negative patterns, but rather
"don't match" positive patterns: instead of the match operator =~ and a
negative pattern, you use the no-match operator !~ and a positive pattern:

@strings = ("abc", "xyz");
@filtered = grep $_ !~ /abc/, @strings;

in r, one way to do the no-match is using -grep, but taking care of the
special case of no matches at all in the input vector.

in perl 5.10, you can try this:

@strings = ("abc", "xyz");
@filtered = grep $_ =~ /(abc)(*COMMIT)(*FAIL)|(*ACCEPT)/, @strings;

which works by making a string that matches the pattern fail, and any
other string succeed despite no match.

vQ

Wacek Kusnierczyk

Sun, Jan 18, 2009 1:04 PM #

Wacek Kusnierczyk wrote:

incidentally, recent pcre accepts such regexes:

# r code
ungrep = function(pattern, x, ...)
    grep(paste(pattern, "(*COMMIT)(*FAIL)|(*ACCEPT)", sep=""), x,
perl=TRUE, ...)

strings = c("abc", "xyz")
pattern = "a[a-z]"
(filtered = strings[ungrep(pattern, strings)])
# "xyz"

vQ

Wacek Kusnierczyk

Sun, Jan 18, 2009 1:18 PM #

Wacek Kusnierczyk wrote:

this was a toy example, but if you need this sort of ungrep with
patterns involving alterations, you need a fix:

ungrep("a|x", strings, value=TRUE)
# "abc"
# NOT character(0)

# fix
ungrep = function(pattern, x, ...)
    grep(paste("(?:", pattern, ")(*COMMIT)(*FAIL)|(*ACCEPT)", sep=""),
x, perl=TRUE, ...)

ungrep("a|x", strings, value=TRUE)
# character(0)


vQ

Rau, Roland

Sun, Jan 18, 2009 1:35 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090118/4b00dfca/attachment-0001.pl>

Gabor Grothendieck

Sun, Jan 18, 2009 1:44 PM #

Well, that's why it was only provided when you insisted.  This is
not what regexp's are good at.

On Sun, Jan 18, 2009 at 4:35 PM, Rau, Roland <Rau at demogr.mpg.de> wrote:

Thanks! (I have to admit, though, that I expected something simple)

Thanks,
Roland



-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
Sent: Sun 1/18/2009 8:54 PM
To: Rau, Roland
Cc: r-help at r-project.org
Subject: Re: [R] regex -> negate a word

Try this:

grep("^([^a]|a[^b]|ab[^c])*.{0,2}$", x, perl = TRUE)


On Sun, Jan 18, 2009 at 2:37 PM, Rau, Roland <Rau at demogr.mpg.de> wrote:

Thank you very much to all of you for your fast and excellent help.
Since the "-grep(...)" solution seems to be favored by most of the
answers,
I just wonder if there is really no regular expression which does the
job?!?

Thanks again,
Roland



-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
Sent: Sun 1/18/2009 8:28 PM
To: Rau, Roland
Cc: r-help at r-project.org
Subject: Re: [R] regex -> negate a word

Try this:

# indexes
setdiff(seq_along(x), grep("abc", x))

# values
setdiff(x, grep("abc", x, value = TRUE))

Another possibility is:

z <- "abc"
x0 <- c(x, z) # to handle no match case
x0[- grep(z, x0)] # values




On Sun, Jan 18, 2009 at 1:35 PM, Rau, Roland <Rau at demogr.mpg.de> wrote:

Dear all,

let's assume I have a vector of character strings:

x <- c("abcdef", "defabc", "qwerty")

What I would like to find is the following: all elements where the word
'abc' does not appear (i.e. 3 in this case of 'x').

Since I am not really experienced with regular expressions, I started
slowly and thought I find all word were 'abc' actually does appear:

grep(pattern="abc", x=x)

[1] 1 2

So far, so good. Now I read that ^ is the negation operator. But it can
also denote the beginning of a string as in:

grep(pattern="^abc", x=x)

[1] 1

Of course, we need to put it inside square brackets to negate the
expression [1]

grep(pattern="[^abc]", x=x)

[1] 1 2 3

But this is not what I want either.

I'd appreciate any help. I assume this is rather easy and
straightforward.

Thanks,
Roland


[1] http://www.zytrax.com/tech/web/regex.htm: The ^ (circumflex or
caret) inside square brackets negates the expression....

----------
This mail has been sent through the MPI for Demographic Research.  Should
you receive a mail that is apparently from a MPI user without this text
displayed, then the address has most likely been faked. If you are
uncertain
about the validity of this message, please check the mail header or ask
your
system administrator for assistance.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Rolf Turner

Sun, Jan 18, 2009 2:02 PM #

On 19/01/2009, at 10:44 AM, Gabor Grothendieck wrote:

It may not be what regexp's are good at, but the grep command in unix/ 
linux
does what is required *very* simply via the ``-v'' flag.  I  
conjecture that
it would not be difficult to add an argument with similar impact to the
grep() function in R.

	cheers,

		Rolf Turner

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}

Gabor Grothendieck

Sun, Jan 18, 2009 2:08 PM #

That's an entirely different point from whether regular expressions
can do it as grep -v is just another way to do it without using a regular
expression to specify the entire job.

On Sun, Jan 18, 2009 at 5:02 PM, Rolf Turner <r.turner at auckland.ac.nz> wrote:

Stavros Macrakis

Sun, Jan 18, 2009 2:32 PM #

On Sun, Jan 18, 2009 at 2:22 PM, Wacek Kusnierczyk

<Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:

Yes.

The meaning of x[V] (for an integer subscript vector V) is: ignore 0
entries, and then:

a) if !(all(V>0) | all(V<0) ) => ERROR
b) if all (V>0): length(x[V]) == length(V)
c) if all (V<0): length(x[V]) == length(x)-length(unique(V))

When length(V)==0, the preconditions are true for both (b) and (c), so
the R design has made the decision that length(x[V]) == 0 in this
case.  If you're going to have the "negative indices means exclusion"
trick, this seems like a reasonable convention.

Of course, that means that you can't in general use x[-V] (where
all(V>0)) to mean "all elements that are not in V".  However, there is
a workaround if you have an upper bound on length(x):

       x[ c(-2^30, -V) ]

This guarantees at least one negative number.

           -s

Gabor Grothendieck

Sun, Jan 18, 2009 2:41 PM #

Note that the variation of this that I posted already handles that case.

On Sun, Jan 18, 2009 at 5:32 PM, Stavros Macrakis <macrakis at alum.mit.edu> wrote:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Wacek Kusnierczyk

Mon, Jan 19, 2009 12:54 AM #

Rolf Turner wrote:

something like grep(..., inverse=TRUE), perhaps.

vQ

Wacek Kusnierczyk

Mon, Jan 19, 2009 1:28 AM #

Stavros Macrakis wrote:

what about numeric vectors?  r performs smart downcasting here:

x[1.1]
# same as x[1]

x[0.3]
# character(0)

what if V=NULL?

there is no error for x[v] with V=0, V=as.numeric(NA), or V=NaN.

unfortunately, false if v contains a non-integer (so it goes beyond your
discussion, but may cause problems in practice):

x[c(1, 0.5)]
# one item (if x is non-empty)

not true for cases like V=c(-1, -1.5), which again go beyond your
discussion, but may happen in practice.

interestingly, unique(c(NA, NA)) is just NA, rather than c(NA,NA).  i'd
think that if we have two non-available values, we can't be sure they're
in fact equal, but unique apparently is.  (you'd have to tell it not to
be with incomparables=NA.)

interestingly, all(V>0) & all(V<0) is TRUE for V=c().

i didn't say this was unreasonable, just that x[integer(0)] should,
arguably, return x.  'empty index' is not as precise an expression to be
sure that it will be obvious to everyone that integer(0) is *not* an
empty index, and less so with NULL.  what is meant, i guess, is 'empty
index expression', i.e., no index rather than empty index, and i'd
humbly suggest (risking being charged with boring pedantry) to improve tfm.


vQ

Brian Ripley

Mon, Jan 19, 2009 2:32 AM #

On Mon, 19 Jan 2009, Rolf Turner wrote:

Indeed.  I have often wondered why grep() returned indices, when a 
logical vector would seem more natural in R (and !grep(...) would have 
been all that was needed).

Looking at the code I see it does in fact compute a logical vector, 
just not return it.  So adding 'invert' (the long-form of -v is 
--invert) is a job of a very few lines and I have done so for 2.9.0.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

1 day later

Wacek Kusnierczyk

Tue, Jan 20, 2009 7:14 AM #

Prof Brian Ripley wrote:

in fact, it's simpler than that.  instead of redundantly distributing
the fix over four different lines in character.c, it's enough to ^= the
logical vector of matched/unmatched flags in just one place, on-the-fly,
close to the end of the loop over the vector of input strings.  see
attached patch.

for consistency, you might want to
- name the internal invert flag 'invert_opt' instead of 'invert';
- apply the same fix to agrep.

it's also trivial to add another argument to grep, say 'logical', which
will cause grep to return a logical vector of the same length as the
input strings vector.  see the attached patch.  note: i am novice to r
internals, and i get some mystical warnings i haven't decoded yet while
using the extended grep, but otherwise the code compiles well and grep
works as intended; you'd need to fix the cause of the warnings.

if you want the 'logical' argument, you need to decide how it interacts
with 'values'.  in the patch, 'values' set to TRUE resets 'logical' to
FALSE, with a warning.

further suggestions:  the arguments 'values' and 'logical' could be
replaced with one argument, say 'output', which would take a value from
{'indices', 'values', 'logical'}.  it might make further extensions
easier to implement and maintain.

attached are patches to character.c, names.c, and grep.R; if you tell me
which other files need a patch to get rid of the warnigns (see below),
i'll make one. 

s = c("abc", "bcd", "cde")

grep("b", s)
# 1 2

grep("b", s, value=TRUE)
# "abc" "bcd"

grep("b", s, logical=TRUE)
# TRUE TRUE FALSE

s[grep("b", s, logical=TRUE)]
# "abc" "bcd"
# Warning: stack imbalance in 'grep', 9 then 10
# Warning: stack imbalance in '.Internal', 8 then 9
# Warning: stack imbalance in '{', 6 then 7

grep("b", s, invert=TRUE)
# 3

grep("b", s, invert=TRUE, value=TRUE)
# "cde"

s[!grep("b", s, logical)]
# "cde"
# Warning: stack imbalance in 'grep', 15 then 16
# Warning: stack imbalance in '.Internal', 14 then 15
# Warning: stack imbalance in '{', 12 then 13
# Warning: stack imbalance in '!', 6 then 7
# Warning: stack imbalance in '[', 2 then 3



vQ

Wacek Kusnierczyk

Tue, Jan 20, 2009 7:17 AM #

Wacek Kusnierczyk wrote:

forgot to add:  the patches are against the latest r-devel
(19.01.2009).  compiled and tested on 32b Ubuntu 8.04.


vQ