An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20131105/de51679d/attachment.pl>
Download CSV Files from EUROSTAT Website
2 messages · Barry Rowlingson, Paul Bivand
This looks as though you need to be a little XML old-school. readHTMLTable is a summary function drawing on: ?htmlTreeParse() turns the table into xml ?xpathApply() and more. #xpathApply(doc, , "//td", function(x)xmlValue(x)) breaks each line at the end of a table cell and extracts the value # The "//th" picks out the table headings without distinction as to whether they are rows or columns Followed by various gsub() and turning it into a matrix (as this comes out with a list of values without columns. I couldn't identify the headings, but the table body is definitely doable. readHTMLTable seems to assume that the column headings are a single row, which isn't always the case. Paul Bivand
On 5 November 2013 18:44, Barry Rowlingson <b.rowlingson at lancaster.ac.uk> wrote:
On 4 Nov 2013 19:30, "David Winsemius" <dwinsemius at comcast.net> wrote:
Maybe you should use their "download" facility rather than trying to
deparse a complex webpage with lots of special user interaction "features":
That web page depends on the user already having been to the previous page to set up a session and so directly downloading a dataset requires setting up cookies and making sure the request has all the right parameters. Looks like a right pain. --
David.
On Nov 4, 2013, at 11:03 AM, Lorenzo Isella wrote:
Thanks. I had already introduced this minor adjustments in the code, but the
real problem (to me) is the information that gets lost: the informative name of the columns, the indicator type and the units.
Cheers Lorenzo On Mon, 04 Nov 2013 19:52:51 +0100, Rui Barradas <ruipbarradas at sapo.pt>
wrote:
Hello, If you want to get rid of the (bp) stuff, you can use lapply/gsub.
Using Jean's code a bit changed,
library(XML)
mylines <- readLines(url("http://bit.ly/1coCohq"))
closeAllConnections()
mytable <- readHTMLTable(mylines, which = 2, asText=TRUE,
stringsAsFactors = FALSE)
str(mytable)
mytable[] <- lapply(mytable, function(x) gsub("\\(.*\\)", "", x))
mytable[] <- lapply(mytable, function(x) gsub(",", "", x))
mytable[] <- lapply(mytable, as.numeric)
colnames(mytable) <- 2000:2013
Hope this helps,
Rui Barradas
Em 04-11-2013 09:53, Lorenzo Isella escreveu:
Hello, And thanks a lot. This is indeed very close to what I need. I am trying to figure out how not to "lose" the headers and how to
avoid
downloading labels like "(p)" together with the numerical data I am interested in. If anyone on the list knows how to make this minor modifications, s/he will make my life much easier. Cheers Lorenzo On Fri, 01 Nov 2013 14:25:49 +0100, Adams, Jean <jvadams at usgs.gov>
wrote:
Lorenzo, I may be able to help you get started. You can use the XML package
to
grab the information >off the internet.
library(XML)
mylines <- readLines(url("http://bit.ly/1coCohq"))
closeAllConnections()mylist <- readHTMLTable(mylines,
asText=TRUE)mytable <- mylist1$xTable
However, when I look at the resulting object, mytable, it doesn't
have
informative row or >column headings. Perhaps someone else can figure out how to get that information. Jean On Thu, Oct 31, 2013 at 10:38 AM, Lorenzo Isella <lorenzo.isella at gmail.com> wrote:
Dear All, I often need to do some work on some data which is publicly
available
on the EUROSTAT >>website. I saw several ways to download automatically mainly the bulk data from EUROSTAT to later on >>postprocess it with R, for instance http://bit.ly/HrDICj http://bit.ly/HrDL10 http://bit.ly/HrDTgT However, what I would like to do is to be able to download directly the csv file >>corresponding to a properly formatted dataset (typically a dynamic dataset) from EUROSTAT. To fix the ideas, please consider the dataset at the following link http://bit.ly/1coCohq what I would like to do is to automatically read its content into R, or at least to >>automatically download it as a csv file (full extraction, single file, no flags and >>footnotes) which I can then manipulate easily. Any suggestion is appreciated. Cheers Lorenzo
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.