gregexpr - match overlap mishandled (PR#13391)

Greg Snow · 2008-12-15T16:32:08Z

> -----Original Message----- > From: Wacek Kusnierczyk [mailto:Waclaw.Marcin.Kusnierczyk at idi.ntnu.no] > Sent: Sunday, December 14, 2008 5:39 AM > To: Greg Snow > Cc: R help > Subject: Re: [Rd] gregexpr - match overlap mishandled (PR#13391) > > Greg Snow wrote: > > Controlling the pointer is going to be very different from perl since > the R functions are vectorized rather than focusing on a single string. > > > > Here is one approach that will give all the matches and lengths (for > the origi

Greg Snow

Mon, Dec 15, 2008 8:32 AM

Yes

The use of regexpr rather than gregexpr and the '^' added to the beginning of the pattern were included to prevent duplicate matches.  This works for the original example and all the cases that I can think of.  If there is some way for this strategy to find duplicate matches (without going to extra effort to negate the effect of '^') I would be interested in learning about it.

I claimed it was 'one way' not the best.  In fact I hope that there are better ways, but I expect that other methods that are better in one way may be worse in others.  If the storage is an issue, do it in a loop rather than creating all the substrings up front.  Memory and time could also be saved by not creating the substrings that are shorter than the matching pattern, I did not do this because I am lazy and figure that the amount of time/memory saved in this example is probably less than the time and effort needed to type in the correction.

If memory, speed, or other efficiencies are really that big an issue, then it may be better to use a different tool (your perl code for example), but then it moves beyond the scope of this list.

I think the biggest issue with my solution is that it currently only works on single strings and will not scale easily to finding all the matches in a vector of strings.  But without additional information on the real problem, it is hard to know which of memory, time, scalability, etc. is the biggest issue (if an issue at all) to address.

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

gregexpr - match overlap mishandled (PR#13391)

Thread (7 messages)