Skip to content

regular expression question

7 messages · Mark Leeds, Berwin A Turlach, Wacek Kusnierczyk +3 more

#
can someone show me how to use a regular expression to break the string 
at the bottom up into its three components :

(-0.791,-0.263]
(-38,-1.24]
(0.96,2.43]

I tried to use strplit because of my regexpitis ( it's not curable. i've 
been to many doctors all over NYC. they tell me there's no cure  )  but 
it doesn't work because there also dots inside  the brackets. Thanks.

(-0.791,-0.263].(-38,-1.24].(0.96,2.43]
#
G'day Mark,

On Tue, 03 Mar 2009 00:16:34 -0600 (CST)
markleeds at verizon.net wrote:

            
Probably you will get better answers from regexp experts, but here we
go:

The problem seems to be that strsplit() throws away the part that is
matched when deciding where to split.  Thus, I guess the aim would be
to replace the `.' on which you want to split by something else and
then use strsplit().  For example you could do:

R> str <- "(-0.791,-0.263].(-38,-1.24].(0.96,2.43]"
R> (uu <- gsub("(\\([^]]*\\])(\\.)", "\\1RIsGreat", str))
[1] "(-0.791,-0.263]RIsGreat(-38,-1.24]RIsGreat(0.96,2.43]"
R> strsplit(uu, "RIsGreat")
[[1]]
[1] "(-0.791,-0.263]" "(-38,-1.24]"     "(0.96,2.43]"    

Though the following works too.

R> (uu <- gsub("(\\([^]]*\\])(\\.)", "\\1?", str))
[1] "(-0.791,-0.263]?(-38,-1.24]?(0.96,2.43]"
R> strsplit(uu, "\\?")
[[1]]
[1] "(-0.791,-0.263]" "(-38,-1.24]"     "(0.96,2.43]"    

To explain the gsub() command, it says look for an opening round
bracket ("\\("), followed by anything but a square close bracket
("[^]]"), followed by a close square bracket ("\\]") which if followed
by a dot ("\\.").  Call the part that is made up from the first three
parts group 1 and the dot group too (that's the open/close brackets in
the regexp:
(\\([^]]\\\)(\\.)
^^^^^^^^^^^^-----
group1      group2

Hopefully that explains the regexp used in the first part, the second
part then says replace this pattern by repeating the first group
("\\1") and by replacing the second group with "RIsGreat" or,
respectively "?".

HTH.

Cheers,

	Berwin

=========================== Full address =============================
Berwin A Turlach                            Tel.: +65 6516 4416 (secr)
Dept of Statistics and Applied Probability        +65 6516 6650 (self)
Faculty of Science                          FAX : +65 6872 3919       
National University of Singapore     
6 Science Drive 2, Blk S16, Level 7          e-mail: statba at nus.edu.sg
Singapore 117546                    http://www.stat.nus.edu.sg/~statba
#
markleeds at verizon.net wrote:
here's one way to get a matrix of numeric values:
   
    text = "(-0.791,-0.263].(-38,-1.24].(0.96,2.43]"
    values = matrix(ncol=2, byrow=TRUE,
        as.numeric(
           grep(pattern='.', value=TRUE,
              x=strsplit(x=text, split=']\\.\\(|\\(|]|,')[[1]])))

modify any of the steps according to your needs.

vQ
#
Wacek Kusnierczyk wrote:
Here is another way with the gsubfn package:

 > require( gsubfn )
 > strapply( text, "\\(.*?,.*?]", perl = T )[[1]]
1] "(-0.791,-0.263]" "(-38,-1.24]"     "(0.96,2.43]"

Note that gregexpr would also help you here:

 > g <- gregexpr( "\\(.*?,.*?]", text, perl = T )[[1]]
 > g
[1]  1 17 29
attr(,"match.length")
[1] 15 11 11

But there is always the missing part of extracting the match from the 
result of (g)regexpr

 > substring( text, g, g + attr(g, "match.length" ) - 1 )
[1] "(-0.791,-0.263]" "(-38,-1.24]"     "(0.96,2.43]"

Romain
#
Here are two solutions using gsubfn package.
strapply works by matching the what you want
rather than what you don't want which may make
it easier in this case.  The two solutions are the
same except we use \\ escapes in the first and
[ ... ] in the second, i.e. \\( has the same effect
as [(].   In each case we first match the ( then
a sequence of characters that is not ] and finally
we match the terminating ].
[1] "(-0.791,-0.263]" "(-38,-1.24]"     "(0.96,2.43]"
[1] "(-0.791,-0.263]" "(-38,-1.24]"     "(0.96,2.43]"
On Tue, Mar 3, 2009 at 1:16 AM, <markleeds at verizon.net> wrote:
1 day later
#
Here is another approach that still uses strspit if you want to stay with that:
[[1]]
[1] "(-0.791,-0.263]" "(-38,-1.24]"     "(0.96,2.43]"   

This uses the Perl 'look-ahead' indicator to say only match on a period that is followed by a '(', but don't include the '(' in the match.

Hope this helps,
#
Greg Snow wrote:
right;  you could extend this pattern to split the string by every dot
that does not separate two digits, for example:
   
    strsplit(tmp, '(?<!\\d)\\.(?!\\d)', perl=TRUE)

of course, this fails if there are numbers without a leading zero, e.g., .11

vQ