Skip to content

Retrieve regular expression groups

7 messages · jim holtman, OKB (not okblacke), Gabor Grothendieck

#
I'm trying to figure out how to get the text captured by capturing 
groups out of a regex match.  For instance, let's say I have the pattern 
"foo ([^ ]+)" and I match it against the string "This is a foo sentence 
I am reading."  The group in the pattern will match the word "sentence" 
in the target string.  How can I get access to this matched group?  All 
I can seem to get the various grep/gsub functions to do is return or 
modify the entire target string.  Isn't there a way to extract ONLY the 
text from a particular group or groups?

Thanks,
#
The strapply function in gsubfn does that.  See http://gsubfn.googlecode.com


On Sun, May 2, 2010 at 6:03 PM, OKB (not okblacke)
<brenbarn at brenbarn.net> wrote:
#
Gabor Grothendieck wrote:

            
Ah, thanks.  The documentation for that function is pretty 
difficult to grasp, but I think I figured it out. . . almost.  However, 
for some reason I can't seem to make strapply work inside an sapply (to 
do multiple regex searches over the same data).  For instance, take a 
look at this toy setup.
[1] "([^ ]+) .i. ([^ ]+)" "([^ ]+) ..g ([^ ]+)"
[1] "this is a big test"   "this is a pig test"   "this is a lim test"   
"this is a non test"   "this is a big foolio"
[6] "this is a wig foolio" "this is a fog test"   "this is a bog test"  

    	With these data, strapply(tmp, pats[1], c) works as expected, as 
does strapply(tmp, pats[2], c).  However, this doesn't work:

sapply(pats, strapply, X=tmp, FUN=c)

Instead it returns a strange table, some of whose elements contain the 
code of strapply itself.  Also, the above code gives different results 
depending on whether I specify "X=tmp" or simply "tmp" as the third 
argument.  Shouldn't these be the same, since X is the first argument of 
strapply?  Any idea what's going on here?

Thanks again,
#
There are quite a few examples in

(1) ?strapply,
(2) on the home page and
(3) in the vignette
(4) on r-help back posts

if you having problems with understanding the textual description.

Note that X and FUN are also arguments to sapply
function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
NULL

so the sapply construct in your post has the effect of applying c to
tmp, pats and strapply so the output you observe is correct.  The
sapply  command never even calls strapply.

On Sun, May 2, 2010 at 8:20 PM, OKB (not okblacke)
<brenbarn at brenbarn.net> wrote:
3 days later
#
Gabor Grothendieck wrote:

            
Ah, I see.  I take it this means it is not possible to use sapply 
with strapply directly.  (I'd have to write a function that wraps 
strapply with different argument names, and then sapply that.)  Anyway, 
I managed to do what I wanted by collapsing my multiple regexes into 
one, so it's good for now.  Thanks for the help.
#
Yes, you have to wrap it in a function but note that gsubfn does have
facilities to make wrapping something in a function easier.  If a
function call is preceded by fn$ then a function in the arguments can
be specified using a formula notation.

For example, first we define NAify which forms a list of its arguments
and then replaces NULLs with NAs and then reshapes into a matrix.
Instead of writing function(el) if (is.null(el)) NA else el we wrote
the indicated formula whose right hand side is the function body and
whose arguments are the free variables in the right hand side, in this
case just el.  Later we use formula notation
to specify function(pat) strapply(tmp, pat, c, simplify = NAify) more compactly.
+ do.call(rbind, fn$lapply(list(...), ~ if (is.null(el)) NA else el))
+ }
[,1] [,2]     [,3] [,4]
[1,] "a"  "test"   "a"  "test"
[2,] "a"  "test"   "a"  "test"
[3,] "a"  "test"   NA   NA
[4,] NA   NA       NA   NA
[5,] "a"  "foolio" "a"  "foolio"
[6,] "a"  "foolio" "a"  "foolio"
[7,] NA   NA       "a"  "test"
[8,] NA   NA       "a"  "test"


On Wed, May 5, 2010 at 11:03 PM, OKB (not okblacke)
<brenbarn at brenbarn.net> wrote: