Scrap java scripts and styles from an html document

----------------------------------------
Date: Thu, 7 Apr 2011 04:15:50 -0700
From: antujsrv at gmail.com
To: r-help at r-project.org
Subject: Re: [R] Scrap java scripts and styles from an html document

Hi ,

I am working on developing a web crawler.
Comments like this come up on the list every few weeks or so and I 
keep suggesting that someone ( other than me of course LOL) investigates
an R interface to webkit for any efforts that require mimic of
large parts of a browser function. Perhaps just make
a debug build or custom build of webkit to dump whatever it is you want
into a structured text file
( I've actually done this for what would amount to a crawler, I modified
maybe one or two classes to output the links being fetched to stdout but
I think there are ways to dump a DOM or other stuff in a format usable by R).
For? valid pages, you can? just parse html as xml and get what you want in this
case but usually people are looking for information only apparent after
large pieces of js are executed. If you want comments only, these
may be easy to isolate yourself.If you google "CRAN HTML parser" some
hits do come up, for example 

http://cran.r-project.org/web/packages/scrapeR/scrapeR.pdf

http://r.789695.n4.nabble.com/How-to-import-HTML-and-SQL-files-td879480.html
Removing javascripts and styles is a part of the cleaning of the html
document.
What I want is a cleaned html document with only the html tags and textual
information,
so that i can figure out the pattern of the web page. This is being done to
extract relevant
information from the webpage like comments for a particular product.

For e.g the amazon.com has all such comments within the
and tags,
with regular
occuring for breaks. So tags which appear the most help us in
locating the required information. Different websites have different
patterns,
but its more likely that tags that will occur the most will have the
relevant information enclosed in them.

So, once the html page is cleaned, it would be easy to role up the tags and
knowing their frequency of occurrence, we can target the information.

Should there be any suggestions to help, please let me know. I would be more
than pleased.

Regards,
Antuj

--
View this message in context: http://r.789695.n4.nabble.com/Scrap-java-scripts-and-styles-from-an-html-document-tp3413894p3433052.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Scrap java scripts and styles from an html document

Thread (2 messages)