parsing pdf files
[copied to list for posterity...] Sorry. I am completely wrong. I've been using itext to split, fill in forms and recombine PDF so assumed (wrongly) that text extraction was possible. In fact, reading the mailing lists is quite informative - clearly PDF is not designed for this. Try this http://pdfbox.apache.org/commandlineutilities/ExtractText can be run from command line so potentially could be automated. Mark 2010/1/10 Mark Wardle <mark at wardle.org>:
If you can use a R <-> java interface, you could use itext to do this as long as the PDF is fairly sane. see http://itextpdf.com/ It is what pdftk uses. b/w Mark 2010/1/9 David Kane <dave at kanecap.com>:
I have a pdf file that I would like to parse into R: http://www.williams.edu/Registrar/geninfo/faculty.pdf For now, I open the file in Acrobat by hand, then save it "as text" and then use readLines(). That works fine but a) I am concerned that some information may be lost and b) I may be doing this a lot, so I would rather have R grab the information from the pdf file directly. So: is there something like readPDF() for R? Thanks, Dave Kane PS. If you're curious, here is the sort of work that I want to do with this data: http://www.ephblog.com/2010/01/08/class-update-and-faculty-ages/
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Dr. Mark Wardle Specialist registrar, Neurology Cardiff, UK
Dr. Mark Wardle Specialist registrar, Neurology Cardiff, UK