web scraping tables generated in multiple server pages
Unfortunately, it's a wretched, vile, SharePoint-based site. That
means it doesn't use traditional encoding methods to do the pagination
and one of the only ways to do this effectively is going to be to use
RSelenium:
library(RSelenium)
library(rvest)
library(dplyr)
library(pbapply)
URL <- "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx"
checkForServer()
startServer()
remDr <- remoteDriver$new()
remDr$open()
remDr$navigate(URL)
pblapply(1:69, function(i) {
if (i %in% seq(1, 69, 10)) {
# the first item on the page is not a link but we can just grab the page
pg <- read_html(remDr$getPageSource()[[1]])
ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
} else {
# we can get the rest of them by the link text directly
ref <- remDr$findElements("xpath",
sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']",
i))
ref[[1]]$clickElement()
pg <- read_html(remDr$getPageSource()[[1]])
ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
}
# we have to move to the next actual page of data after every 10 links
if ((i %% 10) == 0) {
ref <- remDr$findElements("xpath", ".//a[.='...']")
ref[[length(ref)]]$clickElement()
}
ret
}) -> tabs
final_dat <- bind_rows(tabs)
final_dat <- final_dat[, c(1, 2, 5, 7, 8, 13, 14)] # the cols you want
final_dat <- final_dat[complete.cases(final_dat),] # take care of NAs
remDr$quit()
Prbly good ref code to have around, but you can grab the data & code
here: https://gist.github.com/hrbrmstr/ec35ebb32c3cf0aba95f7bad28df1e98
(anything to help a fellow parent out :-)
-Bob
On Tue, May 10, 2016 at 2:45 PM, Michael Friendly <friendly at yorku.ca> wrote:
This is my first attempt to try R web scraping tools, for a project my daughter is working on. It concerns a data base of projects in Sao Paulo, Brazil, listed at http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx, but spread out over 69 pages accessed through a javascript menu at the bottom of the page. Each web page contains 3 HTML tables, of which only the last contains the relevant data. In this, only a subset of columns are of interest. I tried using the XML package as illustrated on several tutorial pages, as shown below. I have no idea how to automate this to extract these tables from multiple web pages. Is there some other package better suited to this task? Can someone help me solve this and other issues? # Goal: read the data tables contained on 69 pages generated by the link below, where # each page is generated by a javascript link in the menu of the bottom of the page. # # Each "page" contains 3 html tables, with names "Table 1", "Table 2", and the only one # of interest with the data, "grdRelSitGeralProcessos" # # From each such table, extract the following columns: #- Processo #- Endere?o #- Distrito #- Area terreno (m2) #- Valor contrapartida ($) #- Area excedente (m2) # NB: All of the numeric fields use "." as comma-separator and "," as the decimal separator, # but because of this are read in as character library(XML) link <- "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx" saopaulo <- htmlParse(link) saopaulo.tables <- readHTMLTable(saopaulo, stringsAsFactors = FALSE) length(saopaulo.tables) # its the third table on this page we want sp.tab <- saopaulo.tables[[3]] # columns wanted wanted <- c(1, 2, 5, 7, 8, 13, 14) head(sp.tab[, wanted])
> head(sp.tab[, wanted])
Proposta Processo Endere??o Distrito 1 1 2002-0.148.242-4 R. DOMINGOS LOPES DA SILVA X R. CORN??LIO VAN CLEVE VILA ANDRADE 2 2 2003-0.129.667-3 AV. DR. JOS?? HIGINO, 200 E 216 AGUA RASA 3 3 2003-0.065.011-2 R. ALIAN??A LIBERAL, 980 E 990 VILA LEOPOLDINA 4 4 2003-0.165.806-0 R. ALIAN??A LIBERAL, 880 E 886 VILA LEOPOLDINA 5 5 2003-0.139.053-0 R. DR. JOS?? DE ANDRADE FIGUEIRA, 111 VILA ANDRADE 6 6 2003-0.200.692-0 R. JOS?? DE JESUS, 66 VILA SONIA ? rea Terreno (m2) ? rea Excedente (m2) Valor Contrapartida (R$) 1 0,00 1.551,14 127.875,98 2 0,00 3.552,13 267.075,77 3 0,00 624,99 70.212,93 4 0,00 395,64 44.447,18 5 0,00 719,68 41.764,46 6 0,00 446,52 85.152,92 thanks, -- Michael Friendly Email: friendly AT yorku DOT ca Professor, Psychology Dept. & Chair, Quantitative Methods York University Voice: 416 736-2100 x66249 Fax: 416 736-5814 4700 Keele Street Web:http://www.datavis.ca Toronto, ONT M3J 1P3 CANADA [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.