Skip to content
Prev 256083 / 398506 Next

Scrap java scripts and styles from an html document

----------------------------------------
Comments like this come up on the list every few weeks or so and I 
keep suggesting that someone ( other than me of course LOL) investigates
an R interface to webkit for any efforts that require mimic of
large parts of a browser function. Perhaps just make
a debug build or custom build of webkit to dump whatever it is you want
into a structured text file
( I've actually done this for what would amount to a crawler, I modified
maybe one or two classes to output the links being fetched to stdout but
I think there are ways to dump a DOM or other stuff in a format usable by R).
For? valid pages, you can? just parse html as xml and get what you want in this
case but usually people are looking for information only apparent after
large pieces of js are executed. If you want comments only, these
may be easy to isolate yourself.If you google "CRAN HTML parser" some
hits do come up, for example 

http://cran.r-project.org/web/packages/scrapeR/scrapeR.pdf

http://r.789695.n4.nabble.com/How-to-import-HTML-and-SQL-files-td879480.html