Skip to content

How to import HTML and SQL files

5 messages · Arup, Warren Young, Dieter Menne +1 more

#
I can't import any HTML or SQL files into R..:confused: Please suggest me the
packages for these two file types and also let me know the syntax for
importing these two type of files. Thank you in advance.

Arup
#
Arup wrote:
Yeah, I'm confused, too.

What exactly is it you're trying to do?  Not the technical task you 
asked about, but the effect you're trying to achieve?  Can you give 
details about the exact nature of your data sources, or, better, examples?

I ask because actually importing HTML and SQL files is almost certainly 
the wrong approach.  You almost never want to handle texts in either 
language directly in R.

For SQL, you usually don't have "SQL files": files literally containing 
SQL queries.  Or if you do happen to have SQL query files, you probably 
don't want to parse them with R.  I expect what you really want is to be 
able to query a database using SQL.  For that, look up DBI on CRAN. 
This will let you connect R to a database server, and use SQL to get 
data from it in a format that R can process directly.

For HTML, the problem is that HTML is a very difficult language to parse 
correctly in the general case.  Much of the reason for that is that few 
web pages are actually legal HTML, but browsers will quietly cope with 
many classes of errors.  To parse such stuff in R, it's usually best to 
take a case-by-case approach, matching particular structures within the 
file so you can extract the few bits of data you want.  You might want 
to post a snippet of the HTML here to get suggestions.

If you really do have to be able to accept arbitrary HTML, I'd suggest 
running the HTML through a filter that converts it to XHTML, then use 
the XML package from CRAN to load it up into R.

You might also want to look into the RCurl package, if the HTML lives on 
a web server.  You can download it directly instead of saving it out to 
an HTML file.  Then you can use the methods above to process it.
#
Arup <arup.pramanik27 <at> gmail.com> writes:
Also confused. HTML and SQL are like apples and bugs.

For HTML (assume you want to extract stock quotes from a site)

-- If you have strict XHTML, using package XML might be
   the best choice, but I doubt you get these nowadays.
-- Otherwise, read in the file and use regular expressions (grep, 
   gsub) to parse.

For SQL: SELECT * from mybase

-- "Importing" that string does not help very much, this is 
   a program telling you what to do when you know your database.
-- You might have a look at package RODBC or RSQLite; details depend on 
   the database you are going to use.

Dieter
#
Dieter Menne wrote:
The htmlParse() and htmlTreeParse() functions in the XML package
use the non-strict HTML parser in libxml2 and so the HTML document
can be malformed.  That parser tends to be quite tolerant so that
you get an HTML tree back, even if the ambiguities in the original
HTML document lead to a tree that one might not expect.

I've not had any troubles parsing HTML files with it.

D.
#
Thanks a lot..I am trying to pick up R on my own.I will be asking you
questions if I am having any problem at any point of time.Thank you.

Arup
Dieter Menne wrote: