Find String Between Characters
It looks like you can get the text of the document with as(mmm[[1]], "character") and you can use grep, strsplit, gsub, etc. on that text. Look at the functions in the XML pacakge for ways to use the XML structure of the data instead of pattern matching to extract meaningful parts of the document. class?HTMLInternalDocument Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Sparks, John James Sent: Saturday, May 14, 2011 7:14 PM To: jim holtman Cc: r-help at r-project.org Subject: Re: [R] Find String Between Characters Hi Jim, Thanks for your note. Unfortunately, when I attempt your solution in my exact setting, I get a weird and slightly different answer. First, let me be more clear. What I am attempting to do is pull the CIK number out of the information from the web page itself after it has loaded to R (this may not be optimal, but I am new at this), not from the web page reference (as you have done). So, when I execute the following as per your suggestion: require(scrapeR) mmm<-scrape(url="http://www.sec.gov/cgi-bin/browse-edgar?actio
n=getcompany&CIK=0000320193&owner=exclude&count=40")
num <- sub("^.*CIK=([0-9]+).*", "\\1", mmm)
I get
[1] "<pointer: 0x00000000001265c0>"
Is this just a hex representation of the same number, or is
something else
going on here?
Comments from any and all would be much appreciated.
--John J. Sparks, Ph.D.
On Sat, May 14, 2011 7:57 pm, jim holtman wrote:
Is this what you want:
y&CIK=0000320193&owner=exclude&count=40"
num <- sub("^.*CIK=([0-9]+).*", "\\1", mmm)
num
[1] "0000320193"
On Sat, May 14, 2011 at 8:20 PM, Sparks, John James
<jspark4 at uic.edu>
wrote:
Dear R Helpers, I am trying to isolate a set of characters between two
other characters
in a long string file. ?I tried some of the examples on the R
help pages
and elsewhere, but I am not able to get it. ?Your help would be much appreciated. require(scrapeR)
mmm<-scrape(url="http://www.sec.gov/cgi-bin/browse-edgar?actio
n=getcompany&CIK=0000320193&owner=exclude&count=40")
str(mmm) I want to get the number 0000320193 that is between the
CIK= and the &.
?I have tried g <- grep( "CIK=|&", mmm ) and temp<-grep(mmm,\CIK=\&) and variations on these themes, but all won't run or come
bask as an
empty object. ?How can I grab this number? Best wishes, --John J. Sparks, Ph.D.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.