Back to formatted view
Raw Message

Message-ID: <4B4895CE.4010409@free.fr>
Date: 2010-01-09T14:42:22Z
From: Laurent Rhelp
Subject: parsing pdf files
In-Reply-To: <e8ec70e41001090511v39e819fckc0c751330027a9a7@mail.gmail.com>

David Kane a ?crit :

>I have a pdf file that I would like to parse into R:
>
>http://www.williams.edu/Registrar/geninfo/faculty.pdf
>
>For now, I open the file in Acrobat by hand, then save it "as text"
>and then use readLines(). That works fine but a) I am concerned that
>some information may be lost and b) I may be doing this a lot, so I
>would rather have R grab the information from the pdf file directly.
>
>So: is there something like readPDF() for R?
>
>Thanks,
>
>Dave Kane
>
>PS. If you're curious, here is the sort of work that I want to do with
>this data:
>http://www.ephblog.com/2010/01/08/class-update-and-faculty-ages/
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
>  
>
Did you know this site ?

http://www.accesspdf.com/pdftk/

There could be a command line to transform the pdf file in XML format 
and then read the XML file with R.