Hi there, I rejoiced when I realized that you can use Perl regex from within R. However, as the FAQ states "Some functions, particularly those involving regular expression matching, themselves use metacharacters, which may need to be escaped by the backslash mechanism. In those cases you may need a quadruple backslash to represent a single literal one. " I was wondering if that is really necessary for perl=TRUE? wouldn't it be possible to parse a string differently in a regex context, e.g. automatically insert \\ for each \ , such that you can use the perl syntax directly? For example, if you want to input a newline as a character, you would use \n anyway. At the moment one says \\n to make it clear to R that you mean \n to make clear that you mean newline... this is pretty annoying. How likely is it that you want to pass a real newline character to PCRE directly? If it's anyhow possible to pass everything between " and " directly to PCRE without expanding it internally in R, please add this to a future version (as an option like noescape=TRUE perhaps?)! I would love to use R instead of Perl for working with regex, without having to do two levels of escape all the time. Thanks, John
Parsing regular expressions differently - feature request
13 messages · John Wiedenhoeft, Gabor Grothendieck, Duncan Murdoch +1 more
Some feature to simplify entry of backslashes has been mentioned many times and keeps coming up from time to time. It would not only be useful for regexp's but also for latex and Windows path names and I too hope that it will be addressed.
On Sat, Nov 8, 2008 at 7:20 AM, John Wiedenhoeft <john at nurfuerspam.de> wrote:
Hi there, I rejoiced when I realized that you can use Perl regex from within R. However, as the FAQ states "Some functions, particularly those involving regular expression matching, themselves use metacharacters, which may need to be escaped by the backslash mechanism. In those cases you may need a quadruple backslash to represent a single literal one. " I was wondering if that is really necessary for perl=TRUE? wouldn't it be possible to parse a string differently in a regex context, e.g. automatically insert \\ for each \ , such that you can use the perl syntax directly? For example, if you want to input a newline as a character, you would use \n anyway. At the moment one says \\n to make it clear to R that you mean \n to make clear that you mean newline... this is pretty annoying. How likely is it that you want to pass a real newline character to PCRE directly? If it's anyhow possible to pass everything between " and " directly to PCRE without expanding it internally in R, please add this to a future version (as an option like noescape=TRUE perhaps?)! I would love to use R instead of Perl for working with regex, without having to do two levels of escape all the time. Thanks, John
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On 08/11/2008 7:20 AM, John Wiedenhoeft wrote:
Hi there, I rejoiced when I realized that you can use Perl regex from within R. However, as the FAQ states "Some functions, particularly those involving regular expression matching, themselves use metacharacters, which may need to be escaped by the backslash mechanism. In those cases you may need a quadruple backslash to represent a single literal one. " I was wondering if that is really necessary for perl=TRUE? wouldn't it be possible to parse a string differently in a regex context, e.g. automatically insert \\ for each \ , such that you can use the perl syntax directly? For example, if you want to input a newline as a character, you would use \n anyway. At the moment one says \\n to make it clear to R that you mean \n to make clear that you mean newline... this is pretty annoying. How likely is it that you want to pass a real newline character to PCRE directly?
No, that's not possible. At the level where the parsing takes place R has no idea of its eventual use, so it can't tell that some strings are going to be interpreted as Perl, and others not. As Gabor mentioned, there have been various discussions of adding a new syntax for strings that are parsed literally, without processing any escapes, but no consensus on the right syntax to use. There are currently some fragile tricks that let you avoid escapes, e.g. using scan() to read a line: > re <- scan(what="", n=1) 1: [^\\] Read 1 item > re [1] "[^\\\\]" (I call this fragile because it works in scripts processed at console level, but not if you type the same thing into a function.) So I agree, it would be nice to have new syntax to allow this. Last time this came up, I argued for something like \verb in LaTeX where the delimiter could be specified differently in each use. Duncan TL suggested triple quotes, as in Python. I think now that triple quotes would be be better than the particular form I suggested. Duncan Murdoch
If it's anyhow possible to pass everything between " and " directly to PCRE without expanding it internally in R, please add this to a future version (as an option like noescape=TRUE perhaps?)! I would love to use R instead of Perl for working with regex, without having to do two levels of escape all the time. Thanks, John
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Sat, Nov 8, 2008 at 9:41 AM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
On 08/11/2008 7:20 AM, John Wiedenhoeft wrote:
Hi there, I rejoiced when I realized that you can use Perl regex from within R. However, as the FAQ states "Some functions, particularly those involving regular expression matching, themselves use metacharacters, which may need to be escaped by the backslash mechanism. In those cases you may need a quadruple backslash to represent a single literal one. " I was wondering if that is really necessary for perl=TRUE? wouldn't it be possible to parse a string differently in a regex context, e.g. automatically insert \\ for each \ , such that you can use the perl syntax directly? For example, if you want to input a newline as a character, you would use \n anyway. At the moment one says \\n to make it clear to R that you mean \n to make clear that you mean newline... this is pretty annoying. How likely is it that you want to pass a real newline character to PCRE directly?
No, that's not possible. At the level where the parsing takes place R has no idea of its eventual use, so it can't tell that some strings are going to be interpreted as Perl, and others not. As Gabor mentioned, there have been various discussions of adding a new syntax for strings that are parsed literally, without processing any escapes, but no consensus on the right syntax to use. There are currently some fragile tricks that let you avoid escapes, e.g. using scan() to read a line:
re <- scan(what="", n=1)
1: [^\\] Read 1 item
re
[1] "[^\\\\]" (I call this fragile because it works in scripts processed at console level, but not if you type the same thing into a function.) So I agree, it would be nice to have new syntax to allow this. Last time this came up, I argued for something like \verb in LaTeX where the delimiter could be specified differently in each use. Duncan TL suggested triple quotes, as in Python. I think now that triple quotes would be be better than the particular form I suggested.
Ruby's quoting method looks quite flexible: http://en.wikibooks.org/wiki/Ruby_Programming/Alternate_quotes
On 08/11/2008 11:03 AM, Gabor Grothendieck wrote:
On Sat, Nov 8, 2008 at 9:41 AM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
On 08/11/2008 7:20 AM, John Wiedenhoeft wrote:
Hi there, I rejoiced when I realized that you can use Perl regex from within R. However, as the FAQ states "Some functions, particularly those involving regular expression matching, themselves use metacharacters, which may need to be escaped by the backslash mechanism. In those cases you may need a quadruple backslash to represent a single literal one. " I was wondering if that is really necessary for perl=TRUE? wouldn't it be possible to parse a string differently in a regex context, e.g. automatically insert \\ for each \ , such that you can use the perl syntax directly? For example, if you want to input a newline as a character, you would use \n anyway. At the moment one says \\n to make it clear to R that you mean \n to make clear that you mean newline... this is pretty annoying. How likely is it that you want to pass a real newline character to PCRE directly?
No, that's not possible. At the level where the parsing takes place R has no idea of its eventual use, so it can't tell that some strings are going to be interpreted as Perl, and others not. As Gabor mentioned, there have been various discussions of adding a new syntax for strings that are parsed literally, without processing any escapes, but no consensus on the right syntax to use. There are currently some fragile tricks that let you avoid escapes, e.g. using scan() to read a line:
re <- scan(what="", n=1)
1: [^\\] Read 1 item
re
[1] "[^\\\\]" (I call this fragile because it works in scripts processed at console level, but not if you type the same thing into a function.) So I agree, it would be nice to have new syntax to allow this. Last time this came up, I argued for something like \verb in LaTeX where the delimiter could be specified differently in each use. Duncan TL suggested triple quotes, as in Python. I think now that triple quotes would be be better than the particular form I suggested.
Ruby's quoting method looks quite flexible: http://en.wikibooks.org/wiki/Ruby_Programming/Alternate_quotes
Thanks, I didn't know about those. I would have preferred Ruby's option to the one I made up when we last had this discussion, but it also suffers from the same flaw: it won't work in Rd files. There the % sign is a comment marker. Saying that sometimes it's not just makes everything more complicated. So right now I'd have to say that Python-style quotes would be my choice. If you want to put '''""" into your string, you'll be stuck using regular quotes and escapes, but I could live with that. Duncan Murdoch
On Sat, Nov 8, 2008 at 2:05 PM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
On 08/11/2008 11:03 AM, Gabor Grothendieck wrote:
On Sat, Nov 8, 2008 at 9:41 AM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
On 08/11/2008 7:20 AM, John Wiedenhoeft wrote:
Hi there, I rejoiced when I realized that you can use Perl regex from within R. However, as the FAQ states "Some functions, particularly those involving regular expression matching, themselves use metacharacters, which may need to be escaped by the backslash mechanism. In those cases you may need a quadruple backslash to represent a single literal one. " I was wondering if that is really necessary for perl=TRUE? wouldn't it be possible to parse a string differently in a regex context, e.g. automatically insert \\ for each \ , such that you can use the perl syntax directly? For example, if you want to input a newline as a character, you would use \n anyway. At the moment one says \\n to make it clear to R that you mean \n to make clear that you mean newline... this is pretty annoying. How likely is it that you want to pass a real newline character to PCRE directly?
No, that's not possible. At the level where the parsing takes place R has no idea of its eventual use, so it can't tell that some strings are going to be interpreted as Perl, and others not. As Gabor mentioned, there have been various discussions of adding a new syntax for strings that are parsed literally, without processing any escapes, but no consensus on the right syntax to use. There are currently some fragile tricks that let you avoid escapes, e.g. using scan() to read a line:
re <- scan(what="", n=1)
1: [^\\] Read 1 item
re
[1] "[^\\\\]" (I call this fragile because it works in scripts processed at console level, but not if you type the same thing into a function.) So I agree, it would be nice to have new syntax to allow this. Last time this came up, I argued for something like \verb in LaTeX where the delimiter could be specified differently in each use. Duncan TL suggested triple quotes, as in Python. I think now that triple quotes would be be better than the particular form I suggested.
Ruby's quoting method looks quite flexible: http://en.wikibooks.org/wiki/Ruby_Programming/Alternate_quotes
Thanks, I didn't know about those. I would have preferred Ruby's option to the one I made up when we last had this discussion, but it also suffers from the same flaw: it won't work in Rd files. There the % sign is a comment marker. Saying that sometimes it's not just makes everything more complicated. So right now I'd have to say that Python-style quotes would be my choice. If you want to put '''""" into your string, you'll be stuck using regular quotes and escapes, but I could live with that. Duncan Murdoch
One could use a different character.
Duncan Murdoch wrote:
On 08/11/2008 11:03 AM, Gabor Grothendieck wrote:
On Sat, Nov 8, 2008 at 9:41 AM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
On 08/11/2008 7:20 AM, John Wiedenhoeft wrote:
Hi there, I rejoiced when I realized that you can use Perl regex from within R. However, as the FAQ states "Some functions, particularly those involving regular expression matching, themselves use metacharacters, which may need to be escaped by the backslash mechanism. In those cases you may need a quadruple backslash to represent a single literal one. " I was wondering if that is really necessary for perl=TRUE? wouldn't it be possible to parse a string differently in a regex context, e.g. automatically insert \\ for each \ , such that you can use the perl syntax directly? For example, if you want to input a newline as a character, you would use \n anyway. At the moment one says \\n to make it clear to R that you mean \n to make clear that you mean newline... this is pretty annoying. How likely is it that you want to pass a real newline character to PCRE directly?
No, that's not possible. At the level where the parsing takes place R has no idea of its eventual use, so it can't tell that some strings are going to be interpreted as Perl, and others not.
Here's a quick hack to achieve the impossible:
mygrep = function(pattern, text, perl=FALSE, ...) {
if (perl) pattern = gsub("\\\\", "\\\\\\\\", pattern)
grep(pattern, text, perl=perl, ...)
}
(text = "lemme \\ it")
# [1] "lemme \\ it"
nchar(text)
# [1] 10
(pattern = "\\")
# [1] "\\"
nchar(pattern)
# [1] 1
grep(pattern, text, perl=TRUE)
# can't go, impossible!
mygrep(pattern, text, perl=TRUE, value=TRUE)
# [1] "lemme \\ it"
vQ
On 08/11/2008 3:16 PM, Wacek Kusnierczyk wrote:
Duncan Murdoch wrote:
On 08/11/2008 11:03 AM, Gabor Grothendieck wrote:
On Sat, Nov 8, 2008 at 9:41 AM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
On 08/11/2008 7:20 AM, John Wiedenhoeft wrote:
Hi there, I rejoiced when I realized that you can use Perl regex from within R. However, as the FAQ states "Some functions, particularly those involving regular expression matching, themselves use metacharacters, which may need to be escaped by the backslash mechanism. In those cases you may need a quadruple backslash to represent a single literal one. " I was wondering if that is really necessary for perl=TRUE? wouldn't it be possible to parse a string differently in a regex context, e.g. automatically insert \\ for each \ , such that you can use the perl syntax directly? For example, if you want to input a newline as a character, you would use \n anyway. At the moment one says \\n to make it clear to R that you mean \n to make clear that you mean newline... this is pretty annoying. How likely is it that you want to pass a real newline character to PCRE directly?
No, that's not possible. At the level where the parsing takes place R has no idea of its eventual use, so it can't tell that some strings are going to be interpreted as Perl, and others not.
Here's a quick hack to achieve the impossible:
That might solve John's problem, but I doubt it. As far as I can see it won't handle \L, for example. Duncan Murdoch
mygrep = function(pattern, text, perl=FALSE, ...) {
if (perl) pattern = gsub("\\\\", "\\\\\\\\", pattern)
grep(pattern, text, perl=perl, ...)
}
(text = "lemme \\ it")
# [1] "lemme \\ it"
nchar(text)
# [1] 10
(pattern = "\\")
# [1] "\\"
nchar(pattern)
# [1] 1
grep(pattern, text, perl=TRUE)
# can't go, impossible!
mygrep(pattern, text, perl=TRUE, value=TRUE)
# [1] "lemme \\ it"
vQ
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Duncan Murdoch wrote:
I was wondering if that is really necessary for perl=TRUE? wouldn't it be possible to parse a string differently in a regex context, e.g. automatically insert \\ for each \ , such that you can use the perl syntax directly? For example, if you want to input a newline as a character, you would use \n anyway. At the moment one says \\n to make it clear to R that you mean \n to make clear that you mean newline... this is pretty annoying. How likely is it that you want to pass a real newline character to PCRE directly?
No, that's not possible. At the level where the parsing takes place R has no idea of its eventual use, so it can't tell that some strings are going to be interpreted as Perl, and others not.
Here's a quick hack to achieve the impossible:
That might solve John's problem, but I doubt it. As far as I can see it won't handle \L, for example.
well, it was not supposed to. it addresses the need for doubling
backslashes when a backslash character is an element of the regex.
foo = "foo\\n\n"
grep("\n", foo, perl=TRUE, value=TRUE)
mygrep("\n", foo, perl=TRUE, value=TRUE)
# both match the newline
grep("\\n", foo, perl=TRUE, value=TRUE)
mygrep("\\n", foo, perl=TRUE, value=TRUE)
# both match (guess what)
bar = "bar\n"
grep("\n", bar, perl=TRUE, value=TRUE)
mygrep("\n", bar, perl=TRUE, value=TRUE)
# both match the newline
grep("\\n", bar, perl=TRUE, value=TRUE)
mygrep("\\n", bar, perl=TRUE, value=TRUE)
# counterintuitively, grep matches (intuitively, it should match
backslash-n, not a newline, but there's just a newline in bar) -- i do
know why it matches, but i'm pretty sure for many of those who do it's
an inconvenient detail, and for those who don't it's a confusing annoyance
zee = "zee\\"
grep("\\", zee, perl=TRUE, value=TRUE)
mygrep("\\", zee, perl=TRUE, value=TRUE)
# grep fails, needs "\\\\"
conclusion? i'd opt for mygrep in my own code; i guessed this was what
john wanted, therefore the post.
vQ
On 08/11/2008 5:43 PM, Wacek Kusnierczyk wrote:
Duncan Murdoch wrote:
I was wondering if that is really necessary for perl=TRUE? wouldn't it be possible to parse a string differently in a regex context, e.g. automatically insert \\ for each \ , such that you can use the perl syntax directly? For example, if you want to input a newline as a character, you would use \n anyway. At the moment one says \\n to make it clear to R that you mean \n to make clear that you mean newline... this is pretty annoying. How likely is it that you want to pass a real newline character to PCRE directly?
No, that's not possible. At the level where the parsing takes place R has no idea of its eventual use, so it can't tell that some strings are going to be interpreted as Perl, and others not.
Here's a quick hack to achieve the impossible:
That might solve John's problem, but I doubt it. As far as I can see it won't handle \L, for example.
well, it was not supposed to. it addresses the need for doubling backslashes when a backslash character is an element of the regex.
\L could be an element of a regex in Perl. Duncan Murdoch
foo = "foo\\n\n"
grep("\n", foo, perl=TRUE, value=TRUE)
mygrep("\n", foo, perl=TRUE, value=TRUE)
# both match the newline
grep("\\n", foo, perl=TRUE, value=TRUE)
mygrep("\\n", foo, perl=TRUE, value=TRUE)
# both match (guess what)
bar = "bar\n"
grep("\n", bar, perl=TRUE, value=TRUE)
mygrep("\n", bar, perl=TRUE, value=TRUE)
# both match the newline
grep("\\n", bar, perl=TRUE, value=TRUE)
mygrep("\\n", bar, perl=TRUE, value=TRUE)
# counterintuitively, grep matches (intuitively, it should match
backslash-n, not a newline, but there's just a newline in bar) -- i do
know why it matches, but i'm pretty sure for many of those who do it's
an inconvenient detail, and for those who don't it's a confusing annoyance
zee = "zee\\"
grep("\\", zee, perl=TRUE, value=TRUE)
mygrep("\\", zee, perl=TRUE, value=TRUE)
# grep fails, needs "\\\\"
conclusion? i'd opt for mygrep in my own code; i guessed this was what
john wanted, therefore the post.
vQ
Duncan Murdoch wrote:
Here's a quick hack to achieve the impossible:
That might solve John's problem, but I doubt it. As far as I can see it won't handle \L, for example.
well, it was not supposed to. it addresses the need for doubling backslashes when a backslash character is an element of the regex.
\L could be an element of a regex in Perl. Duncan Murdoch
indeed; but that's not something my PCRE will happily accept (try
grep("\\LA", "a", perl=TRUE), which may possibly tell you just that).
if not \L, what's your next best argument against mygrep?
vQ
Wacek Kusnierczyk wrote:
Duncan Murdoch wrote:
That might solve John's problem, but I doubt it. As far as I can see
it won't handle \L, for example.
well, it was not supposed to. it addresses the need for doubling
backslashes when a backslash character is an element of the regex.
foo = "foo\\n\n"
grep("\n", foo, perl=TRUE, value=TRUE)
mygrep("\n", foo, perl=TRUE, value=TRUE)
# both match the newline
grep("\\n", foo, perl=TRUE, value=TRUE)
mygrep("\\n", foo, perl=TRUE, value=TRUE)
# both match (guess what)
bar = "bar\n"
grep("\n", bar, perl=TRUE, value=TRUE)
mygrep("\n", bar, perl=TRUE, value=TRUE)
# both match the newline
grep("\\n", bar, perl=TRUE, value=TRUE)
mygrep("\\n", bar, perl=TRUE, value=TRUE)
# counterintuitively, grep matches (intuitively, it should match
backslash-n, not a newline, but there's just a newline in bar) -- i do
know why it matches, but i'm pretty sure for many of those who do it's
an inconvenient detail, and for those who don't it's a confusing annoyance
zee = "zee\\"
grep("\\", zee, perl=TRUE, value=TRUE)
mygrep("\\", zee, perl=TRUE, value=TRUE)
# grep fails, needs "\\\\"
conclusion? i'd opt for mygrep in my own code; i guessed this was what
john wanted, therefore the post.
vQ
here's another example of what could be considered r grep's idiosyncrasy:
grep("\\n", "\n", perl=TRUE)
# matches
grep("\\n", "\\n", perl=TRUE)
# matches
with everything else equal, "\\n" should match *either* newline *or*
backslash-n, no?
vQ
Wacek Kusnierczyk wrote:
Duncan Murdoch wrote:
Here's a quick hack to achieve the impossible:
That might solve John's problem, but I doubt it. As far as I can see
it won't handle \L, for example.
well, it was not supposed to. it addresses the need for doubling
backslashes when a backslash character is an element of the regex.
\L could be an element of a regex in Perl.
you should perhaps specify what you mean by 'could be an element of a regular expression'. there is a difference between a regular expression and a string specifying it. in /\LA/, i'd say the string between the slashes contains three characters, but the regex contains just one; there is no \L in the regex, and no backslash either. in /\\LA/, i'd say the string contains four characters, but the regex just three; there is a backslash there, but no \L. just try qr/\L/ (that's an empty pattern), qr/\LA/ (that's a one-character pattern equivalent to qr/a/), and qr/\\LA/. in a sense, \L is just a macro used in constructing regexes, but it has no place in a regex. in this view, "\LA" and "a" are two dstinct strings specifying the same regex (try qr/\LA/ eq qr/a/). vQ