Extract lines from pdf files
Hi Thomas,
As Jeff wrote, your HTML email is difficult to read. This is a "plain
text" forum.
As for "pointers", here is one suggestion.
Since you write that you can do the necessary actions with a specific
file, try to write a function that carries out those actions for that
same file.
Except when implementing the function, replace any specific data with
the value of an argument passed into the function.
e.g.
txt <- pdf_text("10619.pdf")
would be replaced by
txt <- pdf_text(pdfFile)
and your function would have pdfFile as an argument, as in
myfunc <- function( pdfFile )
Since you can accomplish the task for this file without a function,
you should be able to accomplish the task with a function.
Once you succeed to do that you can then try passing the function
arguments that refer to the other files you need to process.
HTH,
Eric
On Wed, Nov 20, 2019 at 1:09 AM Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:
Please don't spam the mailing list. Especially with HTML format messages. See the Posting Guide. PDF is designed to present data graphically. It is literally possible to place every character in the page in random order and still achieve this visual readability while practically making it nearly impossible to read. I have encountered many PDF files with the same text placed on the page multiple times... again scrambling your option to read it digitally. Tools like "pdftools" can sometimes work when the program that generated the file does so in a simple and extraction-friendly way... but there are no guarantees, and your description suggests that it is likely that you won't be able to accomplish your goal with this file. On November 19, 2019 11:52:20 PM GMT+01:00, Thomas Subia via R-help <r-help at r-project.org> wrote:
Colleagues,
I can extract specific data from lines in a pdf using:
library(pdftools)
pdf_text("10619.pdf")
txt <- pdf_text(".pdf")
write.table(txt,file="mydata.txt")
con <- file('mydata.txt')
open(con)
serial <- read.table(con,skip=5,nrow=1) #Extract[3]flatness <-
read.table(con,skip=11,nrow=1)# Extract [5]
parallel1 <-read.table(con,skip=2,nrow=1)# Extract [5]
parallel2 <-read.table(con,skip=4,nrow=1)# Extract [5]
close(con)
# note here that serial has 4 variables
# flatness had 6 variables
# parallel1 has 5 variables
# parallel2 has 5 variables
# this outputs the specific data I need
serial[3]
flatness[5]
parallel1[5] # Note here that the txt format shows 0.0007not
scientific, is there a way to format this to display the original data?
parallel2[5] # Note here that the txt format shows 0.0006not
scientific, , is there a way to format this to display the original
data?
I'd like to extend this code to all of the pdf files in adirectory and
to generate a table of all the serial, flatness, parallel1 andparallel2
data.
I'm not having a lot of success trying to build thescript for this.
Some pointers would be appreciated.
All the best.
Thomas Subia
Statistician / Senior Quality Engineer
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Sent from my phone. Please excuse my brevity.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.