Dear Lists, What is the appropriate software package for dumping say 20 PDFS in a folder, then creating data visualization with frequency counts of certain words as well as measure correlation within each file for certain key relationships or key words. I am doing text analysis of biases in enterprise software sponsored publications- and need to come up with a statistical threshold. Regards, Ajay Ohri Websites- http://decisionstats.com
text mining analysis and word visualization of pdfs
4 messages · ajay ohri, Karl Ove Hufthammer, Ashim Kapoor +1 more
Ajay Ohri wrote:
What is the appropriate software package for dumping say 20 PDFS in a folder, then creating data visualization with frequency counts of certain words as well as measure correlation within each file for certain key relationships or key words.
pdftotext + Unix? for Poets + R (ggplot2) HTH.
Karl Ove Hufthammer
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110518/fa9dd045/attachment.pl>
1 day later
---------------------------------------- Date: Wed, 18 May 2011 15:24:49 +0530 From: ashimkapoor at gmail.com To: karl at huftis.org CC: r-help at stat.math.ethz.ch Subject: Re: [R] text mining analysis and word visualization of pdfs
On Wed, May 18, 2011 at 1:44 PM, Karl Ove Hufthammer wrote:
Ajay Ohri wrote:
What is the appropriate software package for dumping say 20 PDFS in a folder, then creating data visualization with frequency counts of certain words as well as measure correlation within each file for certain key relationships or key words.
pdftotext + Unix? for Poets + R (ggplot2) What about the tm package ? I am a beginner and I don't know much about
this but I recall that it does have the ability to handle PDF's. A few words from the experts would be nice. I don;t know if I'm an expert, I can't even get a browser that echo's keystrokes in a reasonable time with 4 core CPU on 'dohs, but PDF could mean just about anything in terms of how text is respresented. Whatever R packages do, they will not be able to read the mind of the author. Even with pdftotext, there are many options and even simple things like US IRS instruction forms can be almost impossible to extract in a coherent manner. Many authors could care less about the information as long as the thing looks like paper copy. If you are stuck with PDF, I'd be looking for more tools first as you will probably want to know how they are constrcuted. I would just reiterate that the best approach for many data analysts would be to contact data source explaining problems with improperly authored PDF or other specialized file format that are only supported by limited proprietary tools or that obfuscate information of interest. ?