Skip to content

Reading table data from PDF files

2 messages · Bryan McCloskey, jim holtman

#
All,

Is anyone familiar with a way to use R to read table data from a large collection of PDF files? I'm aware there are various command lines and desktop utilities that might be able to (e.g.,) dump PDFs to text, which could then be parsed for table data. But I'm hoping there is something more integrated that could be incorporated into R functions and scripts to handle large batches of PDFs in a more automated fashion.

Has anyone used R to extract large amounts of tabular data from PDF documents?

-bryan

------
Bryan McCloskey, Ph.D.
IT Specialist (Data Management/Internet)
U.S. Geological Survey
St. Petersburg Coastal & Marine Science Center
600 Fourth St. South
St. Petersburg, FL 33701

South Florida Information Access: http://sofia.usgs.gov
Everglades Depth Estimation Network: http://sofia.usgs.gov/eden
Phone: 727.803.8747x3017 * Fax: 727.803.2032
------
#
I think a lot would depend on exactly how the data is formatted.  I
have used 'pdf2text' converters (many freely available on the web) to
convert to text and then use R to read-in/preprocess the data to get
it into a format to process.

You can invoke these converter with the 'system' function and then
read the output file that is generated.  I would think that you would
have to have some custom code to then interpret the data in the text
file depending on how it was created.

So I am sure you can do it within R, with some auxiliary functions
that are called with 'system', without much trouble.
On Fri, Feb 3, 2012 at 4:11 PM, Bryan McCloskey <bmccloskey at usgs.gov> wrote: