What is the best way to extract text from a pdf?
Phssthpok
The CAM::PDF module is pretty useful for extracting text and maintaining some information about where it came from in the document. It installs /usr/local/bin/getpdftext.pl which demonstrates simple extraction. However, CAM::PDF can only read PDFs that are completely valid.
If you are dealing with ill-formed PDFs, you may need a more lenient parser, such as pdftotext. It dumps foo.pdf to foo.txt, which you could then read into Perl.
来源:https://stackoverflow.com/questions/4730651/what-is-the-best-perl-module-to-extract-text-from-a-pdf