grep txt file names from html
On Oct 31, 2012, at 9:56 AM, chuck.01 wrote:
Sorry, I know I should read a little 1st about this, but I am actually just helping somebody really quick and need help too. I want to grep all of the names of the .txt files mentioned on this html web page: http://www.epa.gov/emap/remap/html/three/data/index.html
This shows code that will identify lines in that source page containing URLs that end in '.txt"'
lines <- readLines(con=url("http://www.epa.gov/emap/remap/html/three/data/index.html") )
Warning message:
In readLines(con = url("http://www.epa.gov/emap/remap/html/three/data/index.html")) :
incomplete final line found on 'http://www.epa.gov/emap/remap/html/three/data/index.html'
# You can generally ignore that warning.
length(grep('\\"http://([./A-Za-z]){1+}\\.txt"', lines) )
[1] 11 Should be fairly straightforward to remove the preceding and trailing material.
sub('(^.*\\")(http://([./A-Za-z]){1+}\\.txt)(".*$)', "\\2", lines[ grep('\\"http://([./A-Za-z]){1+}\\.txt"', lines) ] )
[1] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/benthic/benmet.txt" [2] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/benthic/bencnt.txt" [3] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/location/watchr.txt" [4] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/location/habbest.txt" [5] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/design/sdesign.txt" [6] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/wchem/chmval.txt" [7] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/fish/fshmet.txt" [8] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/fish/fshcnt.txt" [9] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/fish/fshnam.txt" [10] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/tissue/ftmet.txt" [11] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/tissue/ftorg.txt"
Thanks ahead of time. -- View this message in context: http://r.789695.n4.nabble.com/grep-txt-file-names-from-html-tp4648037.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD Alameda, CA, USA