Skip to content

Reading a web page in pdf format

8 messages · v.demart@libero.it, Gábor Csárdi, jim holtman +3 more

#
Each day the daily balance in the following link

http://www.
snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf

is 
updated.

I would like to set up an R procedure to be run daily in a 
server able to read the figures in a couple of lines only 
("Industriale" and "Termoelettrico", towards the end of the balance) 
and put the data in a table.

Is that possible? If yes, what R-packages 
should I use?

Ciao
Vittorio
#
Vittorio,

this isn't really an R problem, you need a tool to extract text from a 
PDF document. I've tried pdftotext from the xpdf bundle, and it worked 
fine for the file you linked. In my Ubuntu Linux it is in the
xpdf-utils package, search to xpdf to find out whether it is available 
on windows if you use windows. 

If you want to call it from R you can use the 'system' function. 

There may be other, better method i'm unaware of, of course.

Best,
Gabor
On Wed, May 09, 2007 at 03:47:59PM +0100, Vittorio wrote:

  
    
#
You can do it with the base toolkit.  Just read the PDF file in as
text and then extract the data:
[1] "       46,6"
[1] "       99,3"

        
On 5/9/07, Vittorio <vdemart1 at tin.it> wrote:

  
    
#
On Wed, 2007-05-09 at 15:47 +0100, Vittorio wrote:
Vittorio,

Keep in mind that PDF files are typically text files. Thus you can read
it in using readLines():

PDFFile <- readLines("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")

# Clean up
unlink("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")
chr [1:989] "%PDF-1.2" "6 0 obj" "<<" "/Length 7 0 R" ...


# Now find the lines containing the values you wish
# Use grep() with a regex for either term
Lines <- grep("(Industriale|Termoelettrico)", PDFFile)
[1] 33 34
[1] "/F3 1 Tf 9 0 0 9 204 304 Tm (Industriale )Tj 9 0 0 9 420 304 Tm (       46,6)Tj"
[2] "9 0 0 9 204 283 Tm (Termoelettrico )Tj 9 0 0 9 420 283 Tm (       99,3)Tj"      


# Now parse the values out of the lines"
Vals <- sub(".*\\((.*)\\).*", "\\1", PDFFile[Lines])
[1] "       46,6" "       99,3"


# Now convert them to numeric
# need to change the ',' to a '.' at least in my locale
[1] 46.6 99.3


HTH,

Marc Schwartz
#
On Wed, 2007-05-09 at 10:55 -0500, Marc Schwartz wrote:
Vittorio,

Just a quick tweak here, given the possibility that the order of the
values may be subject to change.

After reading the file and getting the lines, use:

# Use sub() with 2 back references, 1 for each value in the line
Vals <- sub(".*\\((.*)\\).*\\((.*)\\).*", "\\1 \\2", PDFFile[Lines])
[1] "Industriale         46,6"    "Termoelettrico         99,3"


This gives us the labels and the values. Now convert to a data frame and
then coerce the values to numeric:

DF <- read.table(textConnection(Vals))
V1   V2
1    Industriale 46,6
2 Termoelettrico 99,3


DF$V2 <- as.numeric(sub(",", "\\.", DF$V2))
V1   V2
1    Industriale 46.6
2 Termoelettrico 99.3
'data.frame':   2 obs. of  2 variables:
 $ V1: Factor w/ 2 levels "Industriale",..: 1 2
 $ V2: num  46.6 99.3


HTH,

Marc
#
Modify this to suit.  After grepping out the correct lines we use strapply
to find and emit character sequences that come after a "(" but do not contain
a ")" .  back = -1 says to only emit the backreferences and not the entire
matched expression (which would have included the leading "(" ):

URL <- "http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf"
Lines.raw <- readLines(URL)
Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE)
library(gsubfn)
strapply(Lines, "[(]([^)]*)", back = -1, simplify = rbind)

which gives a character matrix whose first column is the label
and second column is the number in character form.  You can
then manipulate it as desired.
On 5/9/07, Vittorio <vdemart1 at tin.it> wrote:
#
Here is one additional solution.  This one produces a data frame.  The
regular expression removes:

- everything from beginning to first (
- everything from last ( to end
- everything between ) and ( in the middle

The | characters separate the three parts.  Then read.table reads it in.


URL <- "http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf"
Lines.raw <- readLines(URL)
Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE)

rx <- "^[^(]*[(]|[)][^(]*$|[)][^(]*[(]"
read.table(textConnection(gsub(rx, "", Lines)), dec = ",")
On 5/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
#
Great! It's a wonderful mailing list  full of helpful people!
Thanks to all of you

Vittorio

Il Wednesday 09 May 2007 18:57:39 Gabor Grothendieck ha scritto: