Skip to content
Prev 205845 / 398506 Next

parsing pdf files

On Sat, Jan 9, 2010 at 1:11 PM, David Kane <dave at kanecap.com> wrote:
What could it do that saving as text from Acrobat couldn't do? Here's
the problem - PDF is a page description format, it's not designed to
be read back. There's no guarantee that the letters on the page appear
in the PDF in the same order as they seem on the page. The page could
have all the letter 'a's, then the 'b's and so on, positioned in their
right places to make up words. To reconstruct the words you'd have to
spot where the letters were being placed, and then figure out the
breaks and make up the words. Good luck making the sentences.

 Most PDFs aren't that perverse, and you can often get sensible text
out of them. But then you run into font encodings and graphics and
column layouts and stuff. Any effort put into writing a readPDF()
would have to be redone every time someone tried to read a PDF :)

 On Linux/Unix there's a bunch of command line tools for trying to do
this kind of thing with PDF files - see pdftotext for example. You
could run that from R with system() and then read the text with
readLines. But there's absolutely no guarantees this will work.
Windows/Mac versions (did you say what your platform was?) of the
command line tools may be available.

 The real answer is to get the original data in a format with some
kind of semantics that R could read, for example a CSV or some nice
XML format.

Barry