Skip to content
Prev 115410 / 398498 Next

Reading a web page in pdf format

Here is one additional solution.  This one produces a data frame.  The
regular expression removes:

- everything from beginning to first (
- everything from last ( to end
- everything between ) and ( in the middle

The | characters separate the three parts.  Then read.table reads it in.


URL <- "http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf"
Lines.raw <- readLines(URL)
Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE)

rx <- "^[^(]*[(]|[)][^(]*$|[)][^(]*[(]"
read.table(textConnection(gsub(rx, "", Lines)), dec = ",")
On 5/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote: