Skip to content
Prev 383848 / 398502 Next

Extracting the first currency value from PDF files

On 2020-05-13 06:44 -0700, Jeff Newmiller wrote:
Dear Manish and Jeff,

I recently found the programs pdftoppm [1] 
and Google tesseract [2] to be really useful 
when reading text from pdfs formatted as "a 
single column of text of variable sizes", 
e.g. a receipt from a grocery store :)

folder <- "path/to/pdfs"
pdfs <- list.files(folder, ".pdf$")
pdf <- pdfs[1]
cmd <-
  paste0("pdftoppm -png -r 500 ",
         folder, pdf, " /tmp/out && ",
         "tesseract /tmp/out-1.png - ",
         "-l nor --psm 4")
lines <- system(cmd, intern=TRUE)
# x <- lapply(x, system, intern=TRUE)
# names(x) <- pdfs
# saveRDS(x, "texts.rds")

In any other case with a sensibly formatted 
pdf, I would have used pdftotext [3] ...

Best,
Rasmus

[1] https://manpages.debian.org/buster/poppler-utils/pdftoppm.1.en.html
[2] https://manpages.debian.org/buster/tesseract-ocr/tesseract.1.en.html
[3] https://manpages.debian.org/buster/poppler-utils/pdftotext.1.en.html