An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-mac/attachments/20100505/fef3f32f/attachment.pl>
extracting a matched string using regexpr Possible BUG
6 messages · David Winsemius, steven mosher, Simon Urbanek
On May 6, 2010, at 2:21 AM, steven mosher wrote:
see below,
using a regex in sub() fails if the pattern is //d{5} and suceeds
if the pattern [0-9] {5} is used.. see the test cases below.
issue was not on windows machine and david and I had it on MAC.
Except we both were using \\d rather than //d. I believe that Steve is using R 2.11.0 but I am still using R 2.10.1 (but with the release of an Hmisc upgrade I will convert soon.)
David.
> sessionInfo()
R version 2.10.1 RC (2009-12-09 r50695)
x86_64-apple-darwin9.8.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] tcltk stats graphics grDevices utils datasets
methods base
other attached packages:
[1] gsubfn_0.5-2 proto_0.3-8 zoo_1.6-3 SASxport_1.2.3
lattice_0.18-3
loaded via a namespace (and not attached):
[1] chron_2.3-35 grid_2.10.1 tools_2.10.1
>
> r11
>
> mac os 10.5
>
> ---------- Forwarded message ----------
> From: steven mosher <moshersteven at gmail.com>
> Date: Wed, May 5, 2010 at 3:25 PM
> Subject: Re: [R] extracting a matched string using regexpr
> To: David Winsemius <dwinsemius at comcast.net>
> Cc: Gabor Grothendieck <ggrothendieck at gmail.com>, r-help <
> r-help at r-project.org>
>
>
> with a fresh restart
>
>
>
> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</
> th><th>68.9\nW</th><th>26m</th>"
>>
>> test
> [1]
> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</
> th><th>26m</th>"
>> sub(".*(\\d{5}).*", "\\1", test)
> [1] "</th>"
>> sub(".*([0-9]{5}).*", "\\1", test)
> [1] "88958"
>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
>> sub(".*(\\d{5}).*", "\\1", test2)
> [1] "WWWWW"
>>
>> sub(".*(\\d{5}).*", "\\1", test2)
> [1] "WWWWW"
>> sub(".*([0-9]{5}).*", "\\1", test2)
> [1] "12345"
>
>
> Steve.
>
>
>
> On Wed, May 5, 2010 at 3:20 PM, David Winsemius <dwinsemius at comcast.net
> >wrote:
>
>>
>> On May 5, 2010, at 5:35 PM, Gabor Grothendieck wrote:
>>
>> Here are two ways to extract 5 digits.
>>>
>>> In the first one \\1 refers to the portion matched between the
>>> parentheses in the regular expression.
>>>
>>> In the second one strapply is like apply where the object to be
>>> worked
>>> on is the first argument (array for apply, string for strapply) the
>>> second modifies it (which dimension for apply, regular expression
>>> for
>>> strapply) and the last is a function which acts on each value
>>> (typically each row or column for apply and each match for
>>> strapply).
>>> In this case we use c as our function to just return all the
>>> results.
>>> They are returned in a list with one component per string but here
>>> test is just a single string so we get a list one long and we ask
>>> for
>>> the contents of the first component using [[1]].
>>>
>>> # 1 - sub
>>> sub(".*(\\d{5}).*", "\\1", test)
>>>
>>> test
>> [1]
>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</
>> th><th>26m</th>"
>>
>> I get different results than I expected given that "\\d" should be
>> synonymous with "[0-9]":
>>
>>
>>> sub(".*([0-9]{5}).*", "\\1", test)
>> [1] "88958"
>>
>>> sub(".*(\\d{5}).*", "\\1", test)
>> [1] "</th>"
>>
>> --
>> David.
>>
>>>
>>> # 2 - strapply - see http://gsubfn.googlecode.com
>>> library(gsubfn)
>>> strapply(test, "\\d{5}", c)[[1]]
>>>
>>>
>>>
>>> On Wed, May 5, 2010 at 5:13 PM, steven mosher <moshersteven at gmail.com
>>> >
>>> wrote:
>>>
>>>> Given a text like
>>>>
>>>> I want to be able to extract a matched regular expression from a
>>>> piece of
>>>> text.
>>>>
>>>> this apparently works, but is pretty ugly
>>>> # some html
>>>>
>>>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</
>>>> th><th>68.9\nW</th><th>26m</th>"
>>>> # a pattern to extract 5 digits
>>>>
>>>>> pattern<-"[0-9]{5}"
>>>>>
>>>> # regexpr returns a start point[1] and an attribute "match.length"
>>>> attr(,"match.length)
>>>> # get the substring from the start point to the stop point..
>>>> where stop =
>>>> start +length-1
>>>>
>>>>>
>>>>> answer<-substr(test,regexpr(pattern,test)
>>>>> [1],regexpr(pattern,test)
>>>>> [1]+attr(regexpr(pattern,test),"match.length")-1)
>>>>
>>>>> answer
>>>>>
>>>> [1] "88958"
>>>>
>>>> I tried using sub(pattern, replacement, x ) with a regexp that
>>>> captured
>>>> the
>>>> group. I'd found an example of this in the mails
>>>> but it didnt seem to work..
>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-SIG-Mac mailing list
> R-SIG-Mac at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
David Winsemius, MD
West Hartford, CT
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-mac/attachments/20100506/4414d832/attachment.pl>
FWIW I don't think \d is a basic regexp so as I would expect the perl mode to work and it does:
test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE)
[1] "12345" Yet I agree that if should either fail (i.e. return the unmodified string) or return 12345. Also note that the bug is locale-specific: LANG=C R
test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE)
[1] "12345"
sub(".*(\\d{5}).*", "\\1", test2)
[1] "12345" Also note that this is not Mac-specific:
test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
sub(".*(\\d{5}).*", "\\1", test2)
[1] "WWWWW"
system("uname -sr")
Linux 2.6.32-trunk-amd64
Sys.getlocale("LC_CTYPE")
[1] "en_US.UTF-8" Cheers, Simon
On May 6, 2010, at 6:54 AM, David Winsemius wrote:
On May 6, 2010, at 2:21 AM, steven mosher wrote:
see below,
using a regex in sub() fails if the pattern is //d{5} and suceeds
if the pattern [0-9] {5} is used.. see the test cases below.
issue was not on windows machine and david and I had it on MAC.
Except we both were using \\d rather than //d. I believe that Steve is using R 2.11.0 but I am still using R 2.10.1 (but with the release of an Hmisc upgrade I will convert soon.) -- David.
sessionInfo()
R version 2.10.1 RC (2009-12-09 r50695) x86_64-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] tcltk stats graphics grDevices utils datasets methods base other attached packages: [1] gsubfn_0.5-2 proto_0.3-8 zoo_1.6-3 SASxport_1.2.3 lattice_0.18-3 loaded via a namespace (and not attached): [1] chron_2.3-35 grid_2.10.1 tools_2.10.1
r11 mac os 10.5 ---------- Forwarded message ---------- From: steven mosher <moshersteven at gmail.com> Date: Wed, May 5, 2010 at 3:25 PM Subject: Re: [R] extracting a matched string using regexpr To: David Winsemius <dwinsemius at comcast.net> Cc: Gabor Grothendieck <ggrothendieck at gmail.com>, r-help < r-help at r-project.org> with a fresh restart test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
test
[1] "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
sub(".*(\\d{5}).*", "\\1", test)
[1] "</th>"
sub(".*([0-9]{5}).*", "\\1", test)
[1] "88958"
test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
sub(".*(\\d{5}).*", "\\1", test2)
[1] "WWWWW"
sub(".*(\\d{5}).*", "\\1", test2)
[1] "WWWWW"
sub(".*([0-9]{5}).*", "\\1", test2)
[1] "12345" Steve. On Wed, May 5, 2010 at 3:20 PM, David Winsemius <dwinsemius at comcast.net>wrote:
On May 5, 2010, at 5:35 PM, Gabor Grothendieck wrote: Here are two ways to extract 5 digits.
In the first one \\1 refers to the portion matched between the
parentheses in the regular expression.
In the second one strapply is like apply where the object to be worked
on is the first argument (array for apply, string for strapply) the
second modifies it (which dimension for apply, regular expression for
strapply) and the last is a function which acts on each value
(typically each row or column for apply and each match for strapply).
In this case we use c as our function to just return all the results.
They are returned in a list with one component per string but here
test is just a single string so we get a list one long and we ask for
the contents of the first component using [[1]].
# 1 - sub
sub(".*(\\d{5}).*", "\\1", test)
test
[1] "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" I get different results than I expected given that "\\d" should be synonymous with "[0-9]":
sub(".*([0-9]{5}).*", "\\1", test)
[1] "88958"
sub(".*(\\d{5}).*", "\\1", test)
[1] "</th>" -- David.
# 2 - strapply - see http://gsubfn.googlecode.com library(gsubfn) strapply(test, "\\d{5}", c)[[1]] On Wed, May 5, 2010 at 5:13 PM, steven mosher <moshersteven at gmail.com> wrote:
Given a text like I want to be able to extract a matched regular expression from a piece of text. this apparently works, but is pretty ugly # some html test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" # a pattern to extract 5 digits
pattern<-"[0-9]{5}"
# regexpr returns a start point[1] and an attribute "match.length" attr(,"match.length) # get the substring from the start point to the stop point.. where stop = start +length-1
answer<-substr(test,regexpr(pattern,test)[1],regexpr(pattern,test)[1]+attr(regexpr(pattern,test),"match.length")-1)
answer
[1] "88958" I tried using sub(pattern, replacement, x ) with a regexp that captured the group. I'd found an example of this in the mails but it didnt seem to work..
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
[[alternative HTML version deleted]]
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-mac
David Winsemius, MD West Hartford, CT
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-mac
Two Q's:
A) Is this supposed to happen with perl-mode?:
> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</
th><th>68.9\nW</th><th>26m</th>"
>
> sub(".*(\\d{5}).*", "\\1", test, perl=TRUE)
[1] "88958\nW</th><th>26m</th>"
>
> sub(".*([0-9]{5}).*", "\\1", test, perl=TRUE)
[1] "88958\nW</th><th>26m</th>"
Looks to me that a period is being improperly recognized.
On May 6, 2010, at 11:28 AM, Simon Urbanek wrote:
FWIW I don't think \d is a basic regexp
B) With regard to the default (which I read to be extended rather than basic) vs. perl-like, the Extended section of the regex documentation contains: " Symbols \d, \s, \D and \S denote the digit and space classes and their negations."
so as I would expect the perl mode to work and it does:
test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE)
[1] "12345" Yet I agree that if should either fail (i.e. return the unmodified string) or return 12345. Also note that the bug is locale-specific: LANG=C R
test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE)
[1] "12345"
sub(".*(\\d{5}).*", "\\1", test2)
[1] "12345" Also note that this is not Mac-specific:
test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
sub(".*(\\d{5}).*", "\\1", test2)
[1] "WWWWW"
system("uname -sr")
Linux 2.6.32-trunk-amd64
Sys.getlocale("LC_CTYPE")
[1] "en_US.UTF-8" Cheers, Simon On May 6, 2010, at 6:54 AM, David Winsemius wrote:
On May 6, 2010, at 2:21 AM, steven mosher wrote:
see below,
using a regex in sub() fails if the pattern is //d{5} and suceeds
if the pattern [0-9] {5} is used.. see the test cases below.
issue was not on windows machine and david and I had it on MAC.
Except we both were using \\d rather than //d. I believe that Steve is using R 2.11.0 but I am still using R 2.10.1 (but with the release of an Hmisc upgrade I will convert soon.) -- David.
sessionInfo()
R version 2.10.1 RC (2009-12-09 r50695) x86_64-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] tcltk stats graphics grDevices utils datasets methods base other attached packages: [1] gsubfn_0.5-2 proto_0.3-8 zoo_1.6-3 SASxport_1.2.3 lattice_0.18-3 loaded via a namespace (and not attached): [1] chron_2.3-35 grid_2.10.1 tools_2.10.1
r11 mac os 10.5 ---------- Forwarded message ---------- From: steven mosher <moshersteven at gmail.com> Date: Wed, May 5, 2010 at 3:25 PM Subject: Re: [R] extracting a matched string using regexpr To: David Winsemius <dwinsemius at comcast.net> Cc: Gabor Grothendieck <ggrothendieck at gmail.com>, r-help < r-help at r-project.org> with a fresh restart test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</ th><th>68.9\nW</th><th>26m</th>"
test
[1] "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</ th><th>26m</th>"
sub(".*(\\d{5}).*", "\\1", test)
[1] "</th>"
sub(".*([0-9]{5}).*", "\\1", test)
[1] "88958"
test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
sub(".*(\\d{5}).*", "\\1", test2)
[1] "WWWWW"
sub(".*(\\d{5}).*", "\\1", test2)
[1] "WWWWW"
sub(".*([0-9]{5}).*", "\\1", test2)
[1] "12345" Steve. On Wed, May 5, 2010 at 3:20 PM, David Winsemius <dwinsemius at comcast.net
wrote:
On May 5, 2010, at 5:35 PM, Gabor Grothendieck wrote: Here are two ways to extract 5 digits.
In the first one \\1 refers to the portion matched between the
parentheses in the regular expression.
In the second one strapply is like apply where the object to be
worked
on is the first argument (array for apply, string for strapply)
the
second modifies it (which dimension for apply, regular
expression for
strapply) and the last is a function which acts on each value
(typically each row or column for apply and each match for
strapply).
In this case we use c as our function to just return all the
results.
They are returned in a list with one component per string but here
test is just a single string so we get a list one long and we
ask for
the contents of the first component using [[1]].
# 1 - sub
sub(".*(\\d{5}).*", "\\1", test)
test
[1] "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</ th><th>68.9\nW</th><th>26m</th>" I get different results than I expected given that "\\d" should be synonymous with "[0-9]":
sub(".*([0-9]{5}).*", "\\1", test)
[1] "88958"
sub(".*(\\d{5}).*", "\\1", test)
[1] "</th>" -- David.
# 2 - strapply - see http://gsubfn.googlecode.com library(gsubfn) strapply(test, "\\d{5}", c)[[1]] On Wed, May 5, 2010 at 5:13 PM, steven mosher <moshersteven at gmail.com
wrote:
Given a text like I want to be able to extract a matched regular expression from a piece of text. this apparently works, but is pretty ugly # some html test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</ th><th>68.9\nW</th><th>26m</th>" # a pattern to extract 5 digits
pattern<-"[0-9]{5}"
# regexpr returns a start point[1] and an attribute "match.length" attr(,"match.length) # get the substring from the start point to the stop point.. where stop = start +length-1
answer<-substr(test,regexpr(pattern,test) [1],regexpr(pattern,test) [1]+attr(regexpr(pattern,test),"match.length")-1)
answer
[1] "88958" I tried using sub(pattern, replacement, x ) with a regexp that captured the group. I'd found an example of this in the mails but it didnt seem to work..
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
[[alternative HTML version deleted]]
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-mac
David Winsemius, MD West Hartford, CT
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-mac
David Winsemius, MD West Hartford, CT
On May 6, 2010, at 11:50 AM, David Winsemius wrote:
Two Q's: A) Is this supposed to happen with perl-mode?:
test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
sub(".*(\\d{5}).*", "\\1", test, perl=TRUE)
[1] "88958\nW</th><th>26m</th>"
sub(".*([0-9]{5}).*", "\\1", test, perl=TRUE)
[1] "88958\nW</th><th>26m</th>"
Nope - perl does take EOL into account so .* will be matched only to the end of line. For your purposes you want to enable ?s option, so you probably meant:
sub("(?s).*(\\d{5}).*", "\\1", test, perl=TRUE)
[1] "88958"
Looks to me that a period is being improperly recognized. On May 6, 2010, at 11:28 AM, Simon Urbanek wrote:
FWIW I don't think \d is a basic regexp
B) With regard to the default (which I read to be extended rather than basic) vs. perl-like, the Extended section of the regex documentation contains: " Symbols \d, \s, \D and \S denote the digit and space classes and their negations."
Yes, you're right - extended is the default. Cheers, Simon
so as I would expect the perl mode to work and it does:
test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE)
[1] "12345" Yet I agree that if should either fail (i.e. return the unmodified string) or return 12345. Also note that the bug is locale-specific: LANG=C R
test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE)
[1] "12345"
sub(".*(\\d{5}).*", "\\1", test2)
[1] "12345" Also note that this is not Mac-specific:
test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
sub(".*(\\d{5}).*", "\\1", test2)
[1] "WWWWW"
system("uname -sr")
Linux 2.6.32-trunk-amd64
Sys.getlocale("LC_CTYPE")
[1] "en_US.UTF-8" Cheers, Simon On May 6, 2010, at 6:54 AM, David Winsemius wrote:
On May 6, 2010, at 2:21 AM, steven mosher wrote:
see below,
using a regex in sub() fails if the pattern is //d{5} and suceeds
if the pattern [0-9] {5} is used.. see the test cases below.
issue was not on windows machine and david and I had it on MAC.
Except we both were using \\d rather than //d. I believe that Steve is using R 2.11.0 but I am still using R 2.10.1 (but with the release of an Hmisc upgrade I will convert soon.) -- David.
sessionInfo()
R version 2.10.1 RC (2009-12-09 r50695) x86_64-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] tcltk stats graphics grDevices utils datasets methods base other attached packages: [1] gsubfn_0.5-2 proto_0.3-8 zoo_1.6-3 SASxport_1.2.3 lattice_0.18-3 loaded via a namespace (and not attached): [1] chron_2.3-35 grid_2.10.1 tools_2.10.1
r11 mac os 10.5 ---------- Forwarded message ---------- From: steven mosher <moshersteven at gmail.com> Date: Wed, May 5, 2010 at 3:25 PM Subject: Re: [R] extracting a matched string using regexpr To: David Winsemius <dwinsemius at comcast.net> Cc: Gabor Grothendieck <ggrothendieck at gmail.com>, r-help < r-help at r-project.org> with a fresh restart test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
test
[1] "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
sub(".*(\\d{5}).*", "\\1", test)
[1] "</th>"
sub(".*([0-9]{5}).*", "\\1", test)
[1] "88958"
test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
sub(".*(\\d{5}).*", "\\1", test2)
[1] "WWWWW"
sub(".*(\\d{5}).*", "\\1", test2)
[1] "WWWWW"
sub(".*([0-9]{5}).*", "\\1", test2)
[1] "12345" Steve. On Wed, May 5, 2010 at 3:20 PM, David Winsemius <dwinsemius at comcast.net>wrote:
On May 5, 2010, at 5:35 PM, Gabor Grothendieck wrote: Here are two ways to extract 5 digits.
In the first one \\1 refers to the portion matched between the
parentheses in the regular expression.
In the second one strapply is like apply where the object to be worked
on is the first argument (array for apply, string for strapply) the
second modifies it (which dimension for apply, regular expression for
strapply) and the last is a function which acts on each value
(typically each row or column for apply and each match for strapply).
In this case we use c as our function to just return all the results.
They are returned in a list with one component per string but here
test is just a single string so we get a list one long and we ask for
the contents of the first component using [[1]].
# 1 - sub
sub(".*(\\d{5}).*", "\\1", test)
test
[1] "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" I get different results than I expected given that "\\d" should be synonymous with "[0-9]":
sub(".*([0-9]{5}).*", "\\1", test)
[1] "88958"
sub(".*(\\d{5}).*", "\\1", test)
[1] "</th>" -- David.
# 2 - strapply - see http://gsubfn.googlecode.com library(gsubfn) strapply(test, "\\d{5}", c)[[1]] On Wed, May 5, 2010 at 5:13 PM, steven mosher <moshersteven at gmail.com> wrote:
Given a text like I want to be able to extract a matched regular expression from a piece of text. this apparently works, but is pretty ugly # some html test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" # a pattern to extract 5 digits
pattern<-"[0-9]{5}"
# regexpr returns a start point[1] and an attribute "match.length" attr(,"match.length) # get the substring from the start point to the stop point.. where stop = start +length-1
answer<-substr(test,regexpr(pattern,test)[1],regexpr(pattern,test)[1]+attr(regexpr(pattern,test),"match.length")-1)
answer
[1] "88958" I tried using sub(pattern, replacement, x ) with a regexp that captured the group. I'd found an example of this in the mails but it didnt seem to work..
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
[[alternative HTML version deleted]]
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-mac
David Winsemius, MD West Hartford, CT
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-mac
David Winsemius, MD West Hartford, CT