Reading selected lines in an .html file

5 messages · v.demart@libero.it, Uwe Ligges, Nutter, Benjamin +2 more

Original

1

5

v.demart@libero.it

Wed, Jun 4, 2008 12:49 PM #

Dear friend, 

In an R program running permanently on a server I would like to read hour by 
hour the temperature in *C and the humidity from a  site like this (actually, 
from many of such sites):

http://www.wunderground.com/global/stations/16239.html

How can I read the content of the site and select the info I need?

Ciao
Vittorio

Uwe Ligges

Thu, Jun 5, 2008 5:01 AM #

Of course you could use readLines() and post process by using 
appropriate functions such as grep(), strsplit() and friends later on.

Uwe Ligges

vittorio wrote:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Nutter, Benjamin

Thu, Jun 5, 2008 1:45 PM #

I've tried to tackle a similar question at the request of a coworker.
Unfortunately, it is difficult to read in HTML code because it lacks
character that can consistently be used as a delimiter.  The only
guideline I can offer is that any text you're interested in is going to
be between a ">" and a "<".  So the goal is to eliminate anything
between < and >.

What's more, if you really want to read in HTML code, you'll need a good
grasp on HTML itself, and some familiarity with how the code you're
reading in is structured.  For instance, I'm attaching code that I wrote
to read in HTML tables that were generated by other functions commonly
used in my work place.  But my code assumes that the tables are written
by row (using the <tr> tag.

Essentially, after studying the code I was going to read in, I hand
picked the markers that I could use to isolate the text I wanted.  I
then proceeded to play a game of Simon Says to break down the code to
smaller and smaller pieces until I got what I wanted.  

Unless you're going to be doing this a lot, I wouldn't recommend taking
the time to try and write a function like this.  In most cases it's
probably faster just to copy the data by hand.  But if you are
determined to make it work, I hope the ideas help.

Benjamin

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of vittorio
Sent: Wednesday, June 04, 2008 3:50 PM
To: r-help at stat.math.ethz.ch
Subject: [Possible SPAM] [R] Reading selected lines in an .html file

Dear friend, 

In an R program running permanently on a server I would like to read
hour by 
hour the temperature in *C and the humidity from a  site like this
(actually, 
from many of such sites):

http://www.wunderground.com/global/stations/16239.html

How can I read the content of the site and select the info I need?

Ciao
Vittorio

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


===================================

P Please consider the environment before printing this e-mail

Cleveland Clinic is ranked one of the top hospitals
in America by U.S. News & World Report (2007).  
Visit us online at http://www.clevelandclinic.org for
a complete listing of our services, staff and
locations.


Confidentiality Note:  This message is intended for use
only by the individual or entity to which it is addressed
and may contain information that is privileged,
confidential, and exempt from disclosure under applicable
law.  If the reader of this message is not the intended
recipient or the employee or agent responsible for
delivering the message to the intended recipient, you are
hereby notified that any dissemination, distribution or
copying of this communication is strictly prohibited.  If
you have received this communication in error,  please
contact the sender immediately and destroy the material in
its entirety, whether electronic or hard copy.  Thank you.

Daniel Folkinshteyn

Thu, Jun 5, 2008 1:57 PM #

i know this is an R mailing list :) but... i'll recommend you try python 
with the beautifulsoup module - makes html processing a cinch.

another thing to note is that wunderground provides very handy RSS feeds 
for every location, so rather than parsing the html page (with it's 
associated bundles of gunk), you'd have a better time parsing the RSS 
feed. (there are some rss parsing libraries for python, too, but in your 
simple case it may be simpler to just extract stuff manually with some 
well-placed regexps)

so use python to pull that out, and append to a nice tab-delimited file, 
and then in your R process just read from that file.

on 06/05/2008 04:45 PM Nutter, Benjamin said the following:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


===================================

P Please consider the environment before printing this e-mail

Cleveland Clinic is ranked one of the top hospitals
in America by U.S. News & World Report (2007).  
Visit us online at http://www.clevelandclinic.org for
a complete listing of our services, staff and
locations.


Confidentiality Note:  This message is intended for use
only by the individual or entity to which it is addressed
and may contain information that is privileged,
confidential, and exempt from disclosure under applicable
law.  If the reader of this message is not the intended
recipient or the employee or agent responsible for
delivering the message to the intended recipient, you are
hereby notified that any dissemination, distribution or
copying of this communication is strictly prohibited.  If
you have received this communication in error,  please
contact the sender immediately and destroy the material in
its entirety, whether electronic or hard copy.  Thank you.


------------------------------------------------------------------------

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Thu, Jun 5, 2008 2:07 PM #

Staying in R, the XML package in conjunction with the XPATH query
language is likely to be your friend.

+    @pwsid='LIRA']/@value", xmlValue)
[[1]]
[1] "63"

see http://www.w3.org/TR/xpath especially
http://www.w3.org/TR/xpath#path-abbrev for xpath hints.

Martin

Daniel Folkinshteyn <dfolkins at gmail.com> writes:

i know this is an R mailing list :) but... i'll recommend you try
python with the beautifulsoup module - makes html processing a cinch.

another thing to note is that wunderground provides very handy RSS
feeds for every location, so rather than parsing the html page (with
it's associated bundles of gunk), you'd have a better time parsing the
RSS feed. (there are some rss parsing libraries for python, too, but
in your simple case it may be simpler to just extract stuff manually
with some well-placed regexps)

so use python to pull that out, and append to a nice tab-delimited
file, and then in your R process just read from that file.

on 06/05/2008 04:45 PM Nutter, Benjamin said the following:

I've tried to tackle a similar question at the request of a coworker.
Unfortunately, it is difficult to read in HTML code because it lacks
character that can consistently be used as a delimiter.  The only
guideline I can offer is that any text you're interested in is going to
be between a ">" and a "<".  So the goal is to eliminate anything
between < and >.
What's more, if you really want to read in HTML code, you'll need a
good
grasp on HTML itself, and some familiarity with how the code you're
reading in is structured.  For instance, I'm attaching code that I wrote
to read in HTML tables that were generated by other functions commonly
used in my work place.  But my code assumes that the tables are written
by row (using the <tr> tag.
Essentially, after studying the code I was going to read in, I hand
picked the markers that I could use to isolate the text I wanted.  I
then proceeded to play a game of Simon Says to break down the code to
smaller and smaller pieces until I got what I wanted.  Unless you're
going to be doing this a lot, I wouldn't recommend taking
the time to try and write a function like this.  In most cases it's
probably faster just to copy the data by hand.  But if you are
determined to make it work, I hope the ideas help.
Benjamin
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of vittorio
Sent: Wednesday, June 04, 2008 3:50 PM
To: r-help at stat.math.ethz.ch
Subject: [Possible SPAM] [R] Reading selected lines in an .html file
Dear friend, In an R program running permanently on a server I would
like to read
hour by hour the temperature in *C and the humidity from a  site
like this
(actually, from many of such sites):
http://www.wunderground.com/global/stations/16239.html
How can I read the content of the site and select the info I need?
Ciao
Vittorio

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
===================================
P Please consider the environment before printing this e-mail
Cleveland Clinic is ranked one of the top hospitals
in America by U.S. News & World Report (2007).  Visit us online at
http://www.clevelandclinic.org for
a complete listing of our services, staff and
locations.
Confidentiality Note:  This message is intended for use
only by the individual or entity to which it is addressed
and may contain information that is privileged,
confidential, and exempt from disclosure under applicable
law.  If the reader of this message is not the intended
recipient or the employee or agent responsible for
delivering the message to the intended recipient, you are
hereby notified that any dissemination, distribution or
copying of this communication is strictly prohibited.  If
you have received this communication in error,  please
contact the sender immediately and destroy the material in
its entirety, whether electronic or hard copy.  Thank you.
------------------------------------------------------------------------
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793