read data from pdf file

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20051021/7ca53ac7/attachment.pl
Hi, I'm trying to read data from a PDF file.Is it possible to do it
with R? Thanks,  Marco
Basically, No.

But you may be lucky with "copy&paste" using the mouse, from
the display generated in Acrobat Reader to a text file.

The basic procedure here is

1. Click on the "Text Select Tool" (a button usually marked with a "T");

2. Use the mouse to highlight the block of text you want to copy;

3. Depending on your operating system/graphics display: In Windows
   you have (IIRC) to go to "Edit"->""Copy"; in Unlix/Linux with
   X Windows do nothing;

4. "Paste" it into your text file, again as appropriate for your
   operating system.

However, you may not be lucky.

PDF can store its content in stange ways, and what may look on the
screen like contiguous and consecutive text is stored internally
in separate "blocks" (what PDF calls "objects"). And this can apply
even to little bits of text in a paragraph.

When you paste the marked text, it will go in in the order that
PDF finds the blocks in the file. As a result, your text file
may contain bits of text in random order.

This especially applies to things arranged in tables. But it
very much depends on the software that generated the PDF in
the first place.

Since often the data in a PDF file which you may want to copy
in this way will be tabular, you are likely to encounter this
problem!

You can tell this is going to happen when you use the mouse to
highlight the text you intend to copy: starting with the mouse
iin say the top LH corner, move it slowly towards the lower
RH corner of the block. If the highlighting jumps all over the
screen, and/or outside the area you are trying to highlight,
then this is what's happening.

In that case I have sometimes done it by copying lots of little
blocks, too small to provoke the effect. But this is very tedious.

There are other things one can try, such as printing from the
PDF file to a PostScript file, and then using a program like
ps2ascii (which can deal directly with PDF) or pstotext; but frankly
no such program is likely to make a good job of this, because of
the way PS and PDF work.

Sorry to appear unhelpful! But you may get somewhere.

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 21-Oct-05                                       Time: 20:07:17
------------------------------ XFMail ------------------------------
2005/10/21, Ted Harding <Ted.Harding at nessie.mcc.ac.uk>:
On 21-Oct-05 Marco Venanzi wrote:
Hi, I'm trying to read data from a PDF file.Is it possible to do it
with R? Thanks,  Marco
Basically, No.

But you may be lucky with "copy&paste" using the mouse, from
the display generated in Acrobat Reader to a text file.

The basic procedure here is

1. Click on the "Text Select Tool" (a button usually marked with a "T");

2. Use the mouse to highlight the block of text you want to copy;

3. Depending on your operating system/graphics display: In Windows
   you have (IIRC) to go to "Edit"->""Copy"; in Unlix/Linux with
   X Windows do nothing;

4. "Paste" it into your text file, again as appropriate for your
   operating system.

However, you may not be lucky.

PDF can store its content in stange ways, and what may look on the
screen like contiguous and consecutive text is stored internally
in separate "blocks" (what PDF calls "objects"). And this can apply
even to little bits of text in a paragraph.

When you paste the marked text, it will go in in the order that
PDF finds the blocks in the file. As a result, your text file
may contain bits of text in random order.

This especially applies to things arranged in tables. But it
very much depends on the software that generated the PDF in
the first place.

Since often the data in a PDF file which you may want to copy
in this way will be tabular, you are likely to encounter this
problem!

You can tell this is going to happen when you use the mouse to
highlight the text you intend to copy: starting with the mouse
iin say the top LH corner, move it slowly towards the lower
RH corner of the block. If the highlighting jumps all over the
screen, and/or outside the area you are trying to highlight,
then this is what's happening.

In that case I have sometimes done it by copying lots of little
blocks, too small to provoke the effect. But this is very tedious.

There are other things one can try, such as printing from the
PDF file to a PostScript file, and then using a program like
ps2ascii (which can deal directly with PDF) or pstotext; but frankly
no such program is likely to make a good job of this, because of
the way PS and PDF work.

Sorry to appear unhelpful! But you may get somewhere.
Hmm, if this doesn't work you should have a look to pdftolpe, which is
assumed to convert aribitrary PDF files to some LPE readable format.
LPE is a lightweight programmer's editor, that should be able save the
converted file into txt format.

I never used this myself, though. In case you are running Windows my
reply might not be of much help, sorry for that!

good luck

Thomas
Hello again,

2005/10/21, Thomas Sch??nhoff <tschoenhoff at gmail.com>:
2005/10/21, Ted Harding <Ted.Harding at nessie.mcc.ac.uk>:
On 21-Oct-05 Marco Venanzi wrote:
Hi, I'm trying to read data from a PDF file.Is it possible to do it
with R? Thanks,  Marco
Hmm, if this doesn't work you should have a look to pdftolpe, which is
assumed to convert aribitrary PDF files to some LPE readable format.
LPE is a lightweight programmer's editor, that should be able save the
converted file into txt format.

I never used this myself, though. In case you are running Windows my
reply might not be of much help, sorry for that!
I've to correct myself: its pdftoipe, and ipe (I missed before that is
was IPE instead of LPE) is a graphical editor for drawing graphs in PS
and PDF. It can save files in XML but has problems to read in PDF
created by other programs according to its website:
http://ipe.compgeom.org/.

Thomas
Hi,

2005/10/21, Thomas Sch??nhoff <tschoenhoff at gmail.com>:
Hello again,

2005/10/21, Thomas Sch??nhoff <tschoenhoff at gmail.com>:
2005/10/21, Ted Harding <Ted.Harding at nessie.mcc.ac.uk>:
On 21-Oct-05 Marco Venanzi wrote:
Hi, I'm trying to read data from a PDF file.Is it possible to do it
with R? Thanks,  Marco

Hmm, if this doesn't work you should have a look to pdftolpe, which is
assumed to convert aribitrary PDF files to some LPE readable format.
LPE is a lightweight programmer's editor, that should be able save the
converted file into txt format.

I never used this myself, though. In case you are running Windows my
reply might not be of much help, sorry for that!
I've to correct myself: its pdftoipe, and ipe (I missed before that is
was IPE instead of LPE) is a graphical editor for drawing graphs in PS
and PDF. It can save files in XML but has problems to read in PDF
created by other programs according to its website:
http://ipe.compgeom.org/.
After looking up I finally found xpdf-utils which might help you to
convert PDF to text
At least I was able to convert a PDF file to text by typing:

pdftotext name.pdf

at the command line.

Maybe there will be some drawbacks related to the resulting text
format (manual adjustments required), but if there is no other way,
you should give it a shot.

regards

Thomas
In linux (and possibly other *nixes) you can view the file with xpdf and 
simply cut and paste it into another window (I use vi) and it's 
converted to ASCII text on the fly.  For large documents you might have 
to scroll quite a bit to convert the whole document, but this process 
has saved my neck a few times.  It does not work with acroread (the 
linux Acobat Reader program) however.

Dave Roberts
Hi,

2005/10/21, Thomas Sch??nhoff <tschoenhoff at gmail.com>:

Hello again,

2005/10/21, Thomas Sch??nhoff <tschoenhoff at gmail.com>:

2005/10/21, Ted Harding <Ted.Harding at nessie.mcc.ac.uk>:

On 21-Oct-05 Marco Venanzi wrote:

Hi, I'm trying to read data from a PDF file.Is it possible to do it
with R? Thanks,  Marco

Hmm, if this doesn't work you should have a look to pdftolpe, which is
assumed to convert aribitrary PDF files to some LPE readable format.
LPE is a lightweight programmer's editor, that should be able save the
converted file into txt format.

I never used this myself, though. In case you are running Windows my
reply might not be of much help, sorry for that!
I've to correct myself: its pdftoipe, and ipe (I missed before that is
was IPE instead of LPE) is a graphical editor for drawing graphs in PS
and PDF. It can save files in XML but has problems to read in PDF
created by other programs according to its website:
http://ipe.compgeom.org/.

After looking up I finally found xpdf-utils which might help you to
convert PDF to text
At least I was able to convert a PDF file to text by typing:

pdftotext name.pdf

at the command line.

Maybe there will be some drawbacks related to the resulting text
format (manual adjustments required), but if there is no other way,
you should give it a shot.

regards

Thomas

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

--
Hi, I'm trying to read data from a PDF file.Is it possible to do it
with R? Thanks,  Marco [[alternative HTML version deleted]]
Ghostview has at least one method for extracting the text from a PDF 
document.  In particular Text|Extract allows you to select pages for 
extraction.  This may or may not give the same result as pdftotext 
because I think that is ghostscript based.

Your mileage may vary when extracting tables from a PDF.

cheers