Each day the daily balance in the following link http://www. snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf is updated. I would like to set up an R procedure to be run daily in a server able to read the figures in a couple of lines only ("Industriale" and "Termoelettrico", towards the end of the balance) and put the data in a table. Is that possible? If yes, what R-packages should I use? Ciao Vittorio
Reading a web page in pdf format
8 messages · v.demart@libero.it, Gábor Csárdi, jim holtman +3 more
Vittorio, this isn't really an R problem, you need a tool to extract text from a PDF document. I've tried pdftotext from the xpdf bundle, and it worked fine for the file you linked. In my Ubuntu Linux it is in the xpdf-utils package, search to xpdf to find out whether it is available on windows if you use windows. If you want to call it from R you can use the 'system' function. There may be other, better method i'm unaware of, of course. Best, Gabor
On Wed, May 09, 2007 at 03:47:59PM +0100, Vittorio wrote:
Each day the daily balance in the following link http://www. snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf is updated. I would like to set up an R procedure to be run daily in a server able to read the figures in a couple of lines only ("Industriale" and "Termoelettrico", towards the end of the balance) and put the data in a table. Is that possible? If yes, what R-packages should I use? Ciao Vittorio
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Csardi Gabor <csardi at rmki.kfki.hu> MTA RMKI, ELTE TTK
You can do it with the base toolkit. Just read the PDF file in as text and then extract the data:
# read in PDF file as text
x.in <- readLines("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")
# find Industriale
Ind <- grep("Industriale", x.in, value=TRUE)
# find Termoelettrico
Ter <- grep("Termoelettrico", x.in, value=TRUE)
# extract the data
Ind.data <- sub(".*\\(([\\s0-9,]*)\\).*", "\\1", Ind, perl=TRUE)
Ter.data <- sub(".*\\(([\\s0-9,]*)\\).*", "\\1", Ter, perl=TRUE)
Ind.data
[1] " 46,6"
Ter.data
[1] " 99,3"
On 5/9/07, Vittorio <vdemart1 at tin.it> wrote:
Each day the daily balance in the following link http://www. snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf is updated. I would like to set up an R procedure to be run daily in a server able to read the figures in a couple of lines only ("Industriale" and "Termoelettrico", towards the end of the balance) and put the data in a table. Is that possible? If yes, what R-packages should I use? Ciao Vittorio
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?
On Wed, 2007-05-09 at 15:47 +0100, Vittorio wrote:
Each day the daily balance in the following link http://www. snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf is updated. I would like to set up an R procedure to be run daily in a server able to read the figures in a couple of lines only ("Industriale" and "Termoelettrico", towards the end of the balance) and put the data in a table. Is that possible? If yes, what R-packages should I use? Ciao Vittorio
Vittorio,
Keep in mind that PDF files are typically text files. Thus you can read
it in using readLines():
PDFFile <- readLines("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")
# Clean up
unlink("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")
str(PDFFile)
chr [1:989] "%PDF-1.2" "6 0 obj" "<<" "/Length 7 0 R" ...
# Now find the lines containing the values you wish
# Use grep() with a regex for either term
Lines <- grep("(Industriale|Termoelettrico)", PDFFile)
Lines
[1] 33 34
PDFFile[Lines]
[1] "/F3 1 Tf 9 0 0 9 204 304 Tm (Industriale )Tj 9 0 0 9 420 304 Tm ( 46,6)Tj"
[2] "9 0 0 9 204 283 Tm (Termoelettrico )Tj 9 0 0 9 420 283 Tm ( 99,3)Tj"
# Now parse the values out of the lines"
Vals <- sub(".*\\((.*)\\).*", "\\1", PDFFile[Lines])
Vals
[1] " 46,6" " 99,3" # Now convert them to numeric # need to change the ',' to a '.' at least in my locale
as.numeric(gsub(",", "\\.", Vals))
[1] 46.6 99.3 HTH, Marc Schwartz
On Wed, 2007-05-09 at 10:55 -0500, Marc Schwartz wrote:
On Wed, 2007-05-09 at 15:47 +0100, Vittorio wrote:
Each day the daily balance in the following link http://www. snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf is updated. I would like to set up an R procedure to be run daily in a server able to read the figures in a couple of lines only ("Industriale" and "Termoelettrico", towards the end of the balance) and put the data in a table. Is that possible? If yes, what R-packages should I use? Ciao Vittorio
Vittorio,
Keep in mind that PDF files are typically text files. Thus you can read
it in using readLines():
PDFFile <- readLines("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")
# Clean up
unlink("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")
str(PDFFile)
chr [1:989] "%PDF-1.2" "6 0 obj" "<<" "/Length 7 0 R" ...
# Now find the lines containing the values you wish
# Use grep() with a regex for either term
Lines <- grep("(Industriale|Termoelettrico)", PDFFile)
Lines
[1] 33 34
PDFFile[Lines]
[1] "/F3 1 Tf 9 0 0 9 204 304 Tm (Industriale )Tj 9 0 0 9 420 304 Tm ( 46,6)Tj"
[2] "9 0 0 9 204 283 Tm (Termoelettrico )Tj 9 0 0 9 420 283 Tm ( 99,3)Tj"
# Now parse the values out of the lines"
Vals <- sub(".*\\((.*)\\).*", "\\1", PDFFile[Lines])
Vals
[1] " 46,6" " 99,3"
# Now convert them to numeric
# need to change the ',' to a '.' at least in my locale
as.numeric(gsub(",", "\\.", Vals))
[1] 46.6 99.3
Vittorio,
Just a quick tweak here, given the possibility that the order of the
values may be subject to change.
After reading the file and getting the lines, use:
# Use sub() with 2 back references, 1 for each value in the line
Vals <- sub(".*\\((.*)\\).*\\((.*)\\).*", "\\1 \\2", PDFFile[Lines])
Vals
[1] "Industriale 46,6" "Termoelettrico 99,3" This gives us the labels and the values. Now convert to a data frame and then coerce the values to numeric: DF <- read.table(textConnection(Vals))
DF
V1 V2
1 Industriale 46,6
2 Termoelettrico 99,3
DF$V2 <- as.numeric(sub(",", "\\.", DF$V2))
DF
V1 V2 1 Industriale 46.6 2 Termoelettrico 99.3
str(DF)
'data.frame': 2 obs. of 2 variables: $ V1: Factor w/ 2 levels "Industriale",..: 1 2 $ V2: num 46.6 99.3 HTH, Marc
Modify this to suit. After grepping out the correct lines we use strapply
to find and emit character sequences that come after a "(" but do not contain
a ")" . back = -1 says to only emit the backreferences and not the entire
matched expression (which would have included the leading "(" ):
URL <- "http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf"
Lines.raw <- readLines(URL)
Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE)
library(gsubfn)
strapply(Lines, "[(]([^)]*)", back = -1, simplify = rbind)
which gives a character matrix whose first column is the label
and second column is the number in character form. You can
then manipulate it as desired.
On 5/9/07, Vittorio <vdemart1 at tin.it> wrote:
Each day the daily balance in the following link http://www. snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf is updated. I would like to set up an R procedure to be run daily in a server able to read the figures in a couple of lines only ("Industriale" and "Termoelettrico", towards the end of the balance) and put the data in a table. Is that possible? If yes, what R-packages should I use? Ciao Vittorio
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Here is one additional solution. This one produces a data frame. The regular expression removes: - everything from beginning to first ( - everything from last ( to end - everything between ) and ( in the middle The | characters separate the three parts. Then read.table reads it in. URL <- "http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf" Lines.raw <- readLines(URL) Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE) rx <- "^[^(]*[(]|[)][^(]*$|[)][^(]*[(]" read.table(textConnection(gsub(rx, "", Lines)), dec = ",")
On 5/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
Modify this to suit. After grepping out the correct lines we use strapply
to find and emit character sequences that come after a "(" but do not contain
a ")" . back = -1 says to only emit the backreferences and not the entire
matched expression (which would have included the leading "(" ):
URL <- "http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf"
Lines.raw <- readLines(URL)
Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE)
library(gsubfn)
strapply(Lines, "[(]([^)]*)", back = -1, simplify = rbind)
which gives a character matrix whose first column is the label
and second column is the number in character form. You can
then manipulate it as desired.
On 5/9/07, Vittorio <vdemart1 at tin.it> wrote:
Each day the daily balance in the following link http://www. snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf is updated. I would like to set up an R procedure to be run daily in a server able to read the figures in a couple of lines only ("Industriale" and "Termoelettrico", towards the end of the balance) and put the data in a table. Is that possible? If yes, what R-packages should I use? Ciao Vittorio
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Great! It's a wonderful mailing list full of helpful people! Thanks to all of you Vittorio Il Wednesday 09 May 2007 18:57:39 Gabor Grothendieck ha scritto:
Here is one additional solution. This one produces a data frame. The regular expression removes: - everything from beginning to first ( - everything from last ( to end - everything between ) and ( in the middle The | characters separate the three parts. Then read.table reads it in. URL <- "http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf" Lines.raw <- readLines(URL) Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE) rx <- "^[^(]*[(]|[)][^(]*$|[)][^(]*[(]" read.table(textConnection(gsub(rx, "", Lines)), dec = ",") On 5/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
Modify this to suit. After grepping out the correct lines we use
strapply to find and emit character sequences that come after a "(" but
do not contain a ")" . back = -1 says to only emit the backreferences
and not the entire matched expression (which would have included the
leading "(" ):
URL <-
"http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pd
f" Lines.raw <- readLines(URL)
Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE)
library(gsubfn)
strapply(Lines, "[(]([^)]*)", back = -1, simplify = rbind)
which gives a character matrix whose first column is the label
and second column is the number in character form. You can
then manipulate it as desired.
On 5/9/07, Vittorio <vdemart1 at tin.it> wrote:
Each day the daily balance in the following link http://www. snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf is updated. I would like to set up an R procedure to be run daily in a server able to read the figures in a couple of lines only ("Industriale" and "Termoelettrico", towards the end of the balance) and put the data in a table. Is that possible? If yes, what R-packages should I use? Ciao Vittorio
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.