Skip to content

Parsing regular expressions differently - feature request

13 messages · John Wiedenhoeft, Gabor Grothendieck, Duncan Murdoch +1 more

#
Hi there,

I rejoiced when I realized that you can use Perl regex from within R. However, 
as the FAQ states "Some functions, particularly those involving regular 
expression matching, themselves use metacharacters, which may need to be 
escaped by the backslash mechanism. In those cases you may need a quadruple 
backslash to represent a single literal one. "

I was wondering if that is really necessary for perl=TRUE? wouldn't it be 
possible to parse a string differently in a regex context, e.g. automatically 
insert \\ for each \ , such that you can use the perl syntax directly? For 
example, if you want to input a newline as a character, you would use \n 
anyway. At the moment one says \\n to make it clear to R that you mean \n to 
make clear that you mean newline... this is pretty annoying. How likely is it 
that you want to pass a real newline character to PCRE directly?

If it's anyhow possible to pass everything between " and " directly to PCRE 
without expanding it internally in R, please add this to a future version (as 
an option like noescape=TRUE perhaps?)! I would love to use R instead of Perl 
for working with regex, without having to do two levels of escape all the 
time.

Thanks,
John
#
Some feature to simplify entry of backslashes has been mentioned many times
and keeps coming up from time to time. It would not only be useful for regexp's
but also for latex and Windows path names and I too hope that it will be
addressed.
On Sat, Nov 8, 2008 at 7:20 AM, John Wiedenhoeft <john at nurfuerspam.de> wrote:
#
On 08/11/2008 7:20 AM, John Wiedenhoeft wrote:
No, that's not possible.  At the level where the parsing takes place R 
has no idea of its eventual use, so it can't tell that some strings are 
going to be interpreted as Perl, and others not.

As Gabor mentioned, there have been various discussions of adding a new 
syntax for strings that are parsed literally, without processing any 
escapes, but no consensus on the right syntax to use.

There are currently some fragile tricks that let you avoid escapes, e.g. 
using scan() to read a line:

 > re <- scan(what="", n=1)
1: [^\\]
Read 1 item
 > re
[1] "[^\\\\]"

(I call this fragile because it works in scripts processed at console 
level, but not if you type the same thing into a function.)

So I agree, it would be nice to have new syntax to allow this.  Last 
time this came up, I argued for something like \verb in LaTeX where the 
delimiter could be specified differently in each use.  Duncan TL 
suggested triple quotes, as in Python.  I think now that triple quotes 
would be be better than the particular form I suggested.

Duncan Murdoch
#
On Sat, Nov 8, 2008 at 9:41 AM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
Ruby's quoting method looks quite flexible:

http://en.wikibooks.org/wiki/Ruby_Programming/Alternate_quotes
#
On 08/11/2008 11:03 AM, Gabor Grothendieck wrote:
Thanks, I didn't know about those.  I would have preferred Ruby's option 
to the one I made up when we last had this discussion, but it also 
suffers from the same flaw:  it won't work in Rd files.  There the % 
sign is a comment marker.  Saying that sometimes it's not just makes 
everything more complicated.

So right now I'd have to say that Python-style quotes would be my 
choice.  If you want to put '''""" into your string, you'll be stuck 
using regular quotes and escapes, but I could live with that.

Duncan Murdoch
#
On Sat, Nov 8, 2008 at 2:05 PM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
One could use a different character.
#
Duncan Murdoch wrote:
Here's a quick hack to achieve the impossible:

mygrep = function(pattern, text, perl=FALSE, ...) {
   if (perl) pattern = gsub("\\\\", "\\\\\\\\", pattern)
   grep(pattern, text, perl=perl, ...)
}

(text = "lemme \\ it")
# [1] "lemme \\ it"

nchar(text)
# [1] 10

(pattern = "\\")
# [1] "\\"
nchar(pattern)
# [1] 1

grep(pattern, text, perl=TRUE)
# can't go, impossible!

mygrep(pattern, text, perl=TRUE, value=TRUE)
# [1] "lemme \\ it"

vQ
#
On 08/11/2008 3:16 PM, Wacek Kusnierczyk wrote:
That might solve John's problem, but I doubt it.  As far as I can see it 
won't handle \L, for example.

Duncan Murdoch
#
Duncan Murdoch wrote:
well, it was not supposed to.  it addresses the need for doubling
backslashes when a backslash character is an element of the regex. 

foo = "foo\\n\n"

grep("\n", foo, perl=TRUE, value=TRUE)
mygrep("\n", foo, perl=TRUE, value=TRUE)
# both match the newline

grep("\\n", foo, perl=TRUE, value=TRUE)
mygrep("\\n", foo, perl=TRUE, value=TRUE)
# both match (guess what)

bar = "bar\n"

grep("\n", bar, perl=TRUE, value=TRUE)
mygrep("\n", bar, perl=TRUE, value=TRUE)
# both match the newline

grep("\\n", bar, perl=TRUE, value=TRUE)
mygrep("\\n", bar, perl=TRUE, value=TRUE)
# counterintuitively, grep matches (intuitively, it should match
backslash-n, not a newline, but there's just a newline in bar) -- i do
know why it matches, but i'm pretty sure for many of those who do it's
an inconvenient detail, and for those who don't it's a confusing annoyance

zee = "zee\\"

grep("\\", zee, perl=TRUE, value=TRUE)
mygrep("\\", zee, perl=TRUE, value=TRUE)
# grep fails, needs "\\\\"

conclusion?  i'd opt for mygrep in my own code; i guessed this was what
john wanted, therefore the post.

vQ
#
On 08/11/2008 5:43 PM, Wacek Kusnierczyk wrote:
\L could be an element of a regex in Perl.

Duncan Murdoch
#
Duncan Murdoch wrote:
indeed;  but that's not something my PCRE will happily accept (try
grep("\\LA", "a", perl=TRUE), which may possibly tell you just that). 
if not \L, what's your next best argument against mygrep?

vQ
#
Wacek Kusnierczyk wrote:
here's another example of what could be considered r grep's idiosyncrasy:

grep("\\n", "\n", perl=TRUE)
# matches

grep("\\n", "\\n", perl=TRUE)
# matches

with everything else equal, "\\n" should match *either* newline *or*
backslash-n, no?

vQ
#
Wacek Kusnierczyk wrote:
you should perhaps specify what you mean by 'could be an element of a
regular expression'.  there is a difference between a regular expression
and a string specifying it.

in /\LA/, i'd say the string between the slashes contains three
characters, but the regex contains just one;  there is no \L in the
regex, and no backslash either. 

in /\\LA/, i'd say the string contains four characters, but the regex
just three; there is a backslash there, but no \L.

just try qr/\L/ (that's an empty pattern), qr/\LA/ (that's a
one-character pattern equivalent to qr/a/), and qr/\\LA/. 

in a sense, \L is just a macro used in constructing regexes, but it has
no place in a regex.  in this view, "\LA" and "a" are two dstinct
strings specifying the same regex (try qr/\LA/ eq qr/a/).

vQ