Skip to content
Prev 360814 / 398506 Next

web scraping tables generated in multiple server pages

This is my first attempt to try R web scraping tools, for a project my 
daughter is working on.  It concerns a data base of projects in Sao 
Paulo, Brazil, listed at 
http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx, 
but spread out over 69 pages accessed through a javascript menu at the 
bottom of the page.

Each web page contains 3 HTML tables, of which only the last contains 
the relevant data.  In this, only a subset of columns are of interest.  
I tried using the XML package as illustrated on several tutorial pages, 
as shown below.  I have no idea how to automate this to extract these 
tables from multiple web pages.  Is there some other package better 
suited to this task?  Can someone help me solve this and other issues?

# Goal: read the data tables contained on 69 pages generated by the link 
below, where
# each page is generated by a javascript link in the menu of the bottom 
of the page.
#
# Each "page" contains 3 html tables, with names "Table 1", "Table 2", 
and the only one
# of interest with the data, "grdRelSitGeralProcessos"
#
# From each such table, extract the following columns:
#- Processo
#- Endere?o
#- Distrito
#- Area terreno (m2)
#- Valor contrapartida ($)
#- Area excedente (m2)

# NB: All of the numeric fields use "." as comma-separator and "," as 
the decimal separator,
#   but because of this are read in as character


library(XML)
link <- 
"http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx"

saopaulo <- htmlParse(link)
saopaulo.tables <- readHTMLTable(saopaulo, stringsAsFactors = FALSE)
length(saopaulo.tables)

# its the third table on this page we want
sp.tab <- saopaulo.tables[[3]]

# columns wanted
wanted <- c(1, 2, 5, 7, 8, 13, 14)
head(sp.tab[, wanted])

 > head(sp.tab[, wanted])
   Proposta Processo Endere??o        Distrito
1        1 2002-0.148.242-4 R. DOMINGOS LOPES DA SILVA X R. CORN??LIO 
VAN CLEVE    VILA ANDRADE
2        2 2003-0.129.667-3                      AV. DR. JOS?? HIGINO, 
200 E 216       AGUA RASA
3        3 2003-0.065.011-2                       R. ALIAN??A LIBERAL, 
980 E 990 VILA LEOPOLDINA
4        4 2003-0.165.806-0                       R. ALIAN??A LIBERAL, 
880 E 886 VILA LEOPOLDINA
5        5 2003-0.139.053-0                R. DR. JOS?? DE ANDRADE 
FIGUEIRA, 111    VILA ANDRADE
6        6 2003-0.200.692-0                                R. JOS?? DE 
JESUS, 66      VILA SONIA
   ??rea Terreno (m2) ??rea Excedente (m2) Valor Contrapartida (R$)
1               0,00             1.551,14 127.875,98
2               0,00             3.552,13 267.075,77
3               0,00               624,99 70.212,93
4               0,00               395,64 44.447,18
5               0,00               719,68 41.764,46
6               0,00               446,52 85.152,92

thanks,