can someone show me how to use a regular expression to break the string at the bottom up into its three components : (-0.791,-0.263] (-38,-1.24] (0.96,2.43] I tried to use strplit because of my regexpitis ( it's not curable. i've been to many doctors all over NYC. they tell me there's no cure ) but it doesn't work because there also dots inside the brackets. Thanks. (-0.791,-0.263].(-38,-1.24].(0.96,2.43]
regular expression question
7 messages · Mark Leeds, Berwin A Turlach, Wacek Kusnierczyk +3 more
G'day Mark, On Tue, 03 Mar 2009 00:16:34 -0600 (CST)
markleeds at verizon.net wrote:
can someone show me how to use a regular expression to break the string at the bottom up into its three components : (-0.791,-0.263] (-38,-1.24] (0.96,2.43] I tried to use strplit because of my regexpitis ( it's not curable. i've been to many doctors all over NYC. they tell me there's no cure ) but it doesn't work because there also dots inside the brackets. Thanks. (-0.791,-0.263].(-38,-1.24].(0.96,2.43]
Probably you will get better answers from regexp experts, but here we
go:
The problem seems to be that strsplit() throws away the part that is
matched when deciding where to split. Thus, I guess the aim would be
to replace the `.' on which you want to split by something else and
then use strsplit(). For example you could do:
R> str <- "(-0.791,-0.263].(-38,-1.24].(0.96,2.43]"
R> (uu <- gsub("(\\([^]]*\\])(\\.)", "\\1RIsGreat", str))
[1] "(-0.791,-0.263]RIsGreat(-38,-1.24]RIsGreat(0.96,2.43]"
R> strsplit(uu, "RIsGreat")
[[1]]
[1] "(-0.791,-0.263]" "(-38,-1.24]" "(0.96,2.43]"
Though the following works too.
R> (uu <- gsub("(\\([^]]*\\])(\\.)", "\\1?", str))
[1] "(-0.791,-0.263]?(-38,-1.24]?(0.96,2.43]"
R> strsplit(uu, "\\?")
[[1]]
[1] "(-0.791,-0.263]" "(-38,-1.24]" "(0.96,2.43]"
To explain the gsub() command, it says look for an opening round
bracket ("\\("), followed by anything but a square close bracket
("[^]]"), followed by a close square bracket ("\\]") which if followed
by a dot ("\\."). Call the part that is made up from the first three
parts group 1 and the dot group too (that's the open/close brackets in
the regexp:
(\\([^]]\\\)(\\.)
^^^^^^^^^^^^-----
group1 group2
Hopefully that explains the regexp used in the first part, the second
part then says replace this pattern by repeating the first group
("\\1") and by replacing the second group with "RIsGreat" or,
respectively "?".
HTH.
Cheers,
Berwin
=========================== Full address =============================
Berwin A Turlach Tel.: +65 6516 4416 (secr)
Dept of Statistics and Applied Probability +65 6516 6650 (self)
Faculty of Science FAX : +65 6872 3919
National University of Singapore
6 Science Drive 2, Blk S16, Level 7 e-mail: statba at nus.edu.sg
Singapore 117546 http://www.stat.nus.edu.sg/~statba
markleeds at verizon.net wrote:
can someone show me how to use a regular expression to break the string at the bottom up into its three components : (-0.791,-0.263] (-38,-1.24] (0.96,2.43] I tried to use strplit because of my regexpitis ( it's not curable. i've been to many doctors all over NYC. they tell me there's no cure ) but it doesn't work because there also dots inside the brackets. Thanks. (-0.791,-0.263].(-38,-1.24].(0.96,2.43]
here's one way to get a matrix of numeric values:
text = "(-0.791,-0.263].(-38,-1.24].(0.96,2.43]"
values = matrix(ncol=2, byrow=TRUE,
as.numeric(
grep(pattern='.', value=TRUE,
x=strsplit(x=text, split=']\\.\\(|\\(|]|,')[[1]])))
modify any of the steps according to your needs.
vQ
Wacek Kusnierczyk wrote:
markleeds at verizon.net wrote:
can someone show me how to use a regular expression to break the
string at the bottom up into its three components :
(-0.791,-0.263]
(-38,-1.24]
(0.96,2.43]
I tried to use strplit because of my regexpitis ( it's not curable.
i've been to many doctors all over NYC. they tell me there's no cure
) but it doesn't work because there also dots inside the brackets.
Thanks.
(-0.791,-0.263].(-38,-1.24].(0.96,2.43]
here's one way to get a matrix of numeric values:
text = "(-0.791,-0.263].(-38,-1.24].(0.96,2.43]"
values = matrix(ncol=2, byrow=TRUE,
as.numeric(
grep(pattern='.', value=TRUE,
x=strsplit(x=text, split=']\\.\\(|\\(|]|,')[[1]])))
modify any of the steps according to your needs.
vQ
Here is another way with the gsubfn package: > require( gsubfn ) > strapply( text, "\\(.*?,.*?]", perl = T )[[1]] 1] "(-0.791,-0.263]" "(-38,-1.24]" "(0.96,2.43]" Note that gregexpr would also help you here: > g <- gregexpr( "\\(.*?,.*?]", text, perl = T )[[1]] > g [1] 1 17 29 attr(,"match.length") [1] 15 11 11 But there is always the missing part of extracting the match from the result of (g)regexpr > substring( text, g, g + attr(g, "match.length" ) - 1 ) [1] "(-0.791,-0.263]" "(-38,-1.24]" "(0.96,2.43]" Romain
Romain Francois Independent R Consultant +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr
Here are two solutions using gsubfn package. strapply works by matching the what you want rather than what you don't want which may make it easier in this case. The two solutions are the same except we use \\ escapes in the first and [ ... ] in the second, i.e. \\( has the same effect as [(]. In each case we first match the ( then a sequence of characters that is not ] and finally we match the terminating ].
library(gsubfn) x <- "(-0.791,-0.263].(-38,-1.24].(0.96,2.43]"
strapply(x, "\\([^]]+[]]")[[1]]
[1] "(-0.791,-0.263]" "(-38,-1.24]" "(0.96,2.43]"
strapply(x, "[(][^]]+[]]")[[1]]
[1] "(-0.791,-0.263]" "(-38,-1.24]" "(0.96,2.43]"
On Tue, Mar 3, 2009 at 1:16 AM, <markleeds at verizon.net> wrote:
can someone show me how to use a regular expression to break the string at the bottom up into its three components : (-0.791,-0.263] (-38,-1.24] (0.96,2.43] I tried to use strplit because of my regexpitis ( it's not curable. i've been to many doctors all over NYC. they tell me there's no cure ?) ?but it doesn't work because there also dots inside ?the brackets. Thanks. (-0.791,-0.263].(-38,-1.24].(0.96,2.43]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
1 day later
Here is another approach that still uses strspit if you want to stay with that:
tmp <- '(-0.791,-0.263].(-38,-1.24].(0.96,2.43]' strsplit(tmp, '\\.(?=\\()', perl=TRUE)
[[1]]
[1] "(-0.791,-0.263]" "(-38,-1.24]" "(0.96,2.43]"
This uses the Perl 'look-ahead' indicator to say only match on a period that is followed by a '(', but don't include the '(' in the match.
Hope this helps,
Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111 > -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of markleeds at verizon.net > Sent: Monday, March 02, 2009 11:17 PM > To: r-help at r-project.org > Subject: [R] regular expression question > > can someone show me how to use a regular expression to break the string > at the bottom up into its three components : > > (-0.791,-0.263] > (-38,-1.24] > (0.96,2.43] > > I tried to use strplit because of my regexpitis ( it's not curable. > i've > been to many doctors all over NYC. they tell me there's no cure ) but > it doesn't work because there also dots inside the brackets. Thanks. > > (-0.791,-0.263].(-38,-1.24].(0.96,2.43] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
Greg Snow wrote:
Here is another approach that still uses strspit if you want to stay with that:
tmp <- '(-0.791,-0.263].(-38,-1.24].(0.96,2.43]'
strsplit(tmp, '\\.(?=\\()', perl=TRUE)
[[1]]
[1] "(-0.791,-0.263]" "(-38,-1.24]" "(0.96,2.43]"
This uses the Perl 'look-ahead' indicator to say only match on a period that is followed by a '(', but don't include the '(' in the match.
right; you could extend this pattern to split the string by every dot
that does not separate two digits, for example:
strsplit(tmp, '(?<!\\d)\\.(?!\\d)', perl=TRUE)
of course, this fails if there are numbers without a leading zero, e.g., .11
vQ