Extracting the first currency value from PDF files
On 2020-05-13 06:44 -0700, Jeff Newmiller wrote:
On May 13, 2020 6:33:03 AM PDT, Manish Mukherjee wrote:
How to extract this value from a number of PDF files and put it in a data frame.
they could be part of embedded bitmaps.
Dear Manish and Jeff,
I recently found the programs pdftoppm [1]
and Google tesseract [2] to be really useful
when reading text from pdfs formatted as "a
single column of text of variable sizes",
e.g. a receipt from a grocery store :)
folder <- "path/to/pdfs"
pdfs <- list.files(folder, ".pdf$")
pdf <- pdfs[1]
cmd <-
paste0("pdftoppm -png -r 500 ",
folder, pdf, " /tmp/out && ",
"tesseract /tmp/out-1.png - ",
"-l nor --psm 4")
lines <- system(cmd, intern=TRUE)
# x <- lapply(x, system, intern=TRUE)
# names(x) <- pdfs
# saveRDS(x, "texts.rds")
In any other case with a sensibly formatted
pdf, I would have used pdftotext [3] ...
Best,
Rasmus
[1] https://manpages.debian.org/buster/poppler-utils/pdftoppm.1.en.html
[2] https://manpages.debian.org/buster/tesseract-ocr/tesseract.1.en.html
[3] https://manpages.debian.org/buster/poppler-utils/pdftotext.1.en.html