text mining analysis and word visualization of pdfs - R-help

Tue, May 17, 2011 9:57 PM #

Dear Lists,

What is the appropriate software package for dumping say 20 PDFS in a
folder, then creating data visualization with frequency counts of
certain words as well as measure correlation within each file for
certain key relationships or key words.

I am doing text analysis of biases in enterprise software sponsored
publications- and need to come up with a statistical threshold.

Regards,

Ajay Ohri

Websites-
http://decisionstats.com

Karl Ove Hufthammer

Wed, May 18, 2011 1:14 AM #

Ajay Ohri wrote:

pdftotext + Unix? for Poets + R (ggplot2)

HTH.

Karl Ove Hufthammer

Ashim Kapoor

Wed, May 18, 2011 2:54 AM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110518/fa9dd045/attachment.pl>

Mike Marchywka

Thu, May 19, 2011 4:26 AM #

----------------------------------------
Date: Wed, 18 May 2011 15:24:49 +0530
From: ashimkapoor at gmail.com
To: karl at huftis.org
CC: r-help at stat.math.ethz.ch
Subject: Re: [R] text mining analysis and word visualization of pdfs

On Wed, May 18, 2011 at 1:44 PM, Karl Ove Hufthammer wrote:

this but I recall that it does have the ability to handle PDF's. A few words
from the experts would be nice.

I don;t know if I'm an expert, I can't even get a browser that echo's
keystrokes in a reasonable time with 4 core CPU on 'dohs, but PDF
could mean just about anything in terms of how text is respresented. Whatever
R packages do, they will not be able to read the mind of the author.
Even with pdftotext, there are many options and even simple things like
US IRS instruction forms can be almost impossible to extract in a coherent
manner. Many authors could care less about the information as long as the
thing looks like paper copy. If you are stuck with PDF, I'd be looking
for more tools first as you will probably want to know how they are constrcuted. 

I would just reiterate that the best approach for many data analysts would
be to contact data source explaining problems with improperly authored PDF or
other specialized file format that are only supported by limited proprietary tools
or that obfuscate information of interest. 


?