All, Is anyone familiar with a way to use R to read table data from a large collection of PDF files? I'm aware there are various command lines and desktop utilities that might be able to (e.g.,) dump PDFs to text, which could then be parsed for table data. But I'm hoping there is something more integrated that could be incorporated into R functions and scripts to handle large batches of PDFs in a more automated fashion. Has anyone used R to extract large amounts of tabular data from PDF documents? -bryan ------ Bryan McCloskey, Ph.D. IT Specialist (Data Management/Internet) U.S. Geological Survey St. Petersburg Coastal & Marine Science Center 600 Fourth St. South St. Petersburg, FL 33701 South Florida Information Access: http://sofia.usgs.gov Everglades Depth Estimation Network: http://sofia.usgs.gov/eden Phone: 727.803.8747x3017 * Fax: 727.803.2032 ------
Reading table data from PDF files
2 messages · Bryan McCloskey, jim holtman
I think a lot would depend on exactly how the data is formatted. I have used 'pdf2text' converters (many freely available on the web) to convert to text and then use R to read-in/preprocess the data to get it into a format to process. You can invoke these converter with the 'system' function and then read the output file that is generated. I would think that you would have to have some custom code to then interpret the data in the text file depending on how it was created. So I am sure you can do it within R, with some auxiliary functions that are called with 'system', without much trouble.
On Fri, Feb 3, 2012 at 4:11 PM, Bryan McCloskey <bmccloskey at usgs.gov> wrote:
All, Is anyone familiar with a way to use R to read table data from a large collection of PDF files? I'm aware there are various command lines and desktop utilities that might be able to (e.g.,) dump PDFs to text, which could then be parsed for table data. But I'm hoping there is something more integrated that could be incorporated into R functions and scripts to handle large batches of PDFs in a more automated fashion. Has anyone used R to extract large amounts of tabular data from PDF documents? -bryan ------ Bryan McCloskey, Ph.D. IT Specialist (Data Management/Internet) U.S. Geological Survey St. Petersburg Coastal & Marine Science Center 600 Fourth St. South St. Petersburg, FL 33701 South Florida Information Access: http://sofia.usgs.gov Everglades Depth Estimation Network: http://sofia.usgs.gov/eden Phone: 727.803.8747x3017 * Fax: 727.803.2032 ------
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.