Reading a web page in pdf format
Here is one additional solution. This one produces a data frame. The regular expression removes: - everything from beginning to first ( - everything from last ( to end - everything between ) and ( in the middle The | characters separate the three parts. Then read.table reads it in. URL <- "http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf" Lines.raw <- readLines(URL) Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE) rx <- "^[^(]*[(]|[)][^(]*$|[)][^(]*[(]" read.table(textConnection(gsub(rx, "", Lines)), dec = ",")
On 5/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
Modify this to suit. After grepping out the correct lines we use strapply
to find and emit character sequences that come after a "(" but do not contain
a ")" . back = -1 says to only emit the backreferences and not the entire
matched expression (which would have included the leading "(" ):
URL <- "http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf"
Lines.raw <- readLines(URL)
Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE)
library(gsubfn)
strapply(Lines, "[(]([^)]*)", back = -1, simplify = rbind)
which gives a character matrix whose first column is the label
and second column is the number in character form. You can
then manipulate it as desired.
On 5/9/07, Vittorio <vdemart1 at tin.it> wrote:
Each day the daily balance in the following link http://www. snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf is updated. I would like to set up an R procedure to be run daily in a server able to read the figures in a couple of lines only ("Industriale" and "Termoelettrico", towards the end of the balance) and put the data in a table. Is that possible? If yes, what R-packages should I use? Ciao Vittorio
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.