How can I extract text from a PDF file in Perl?

后端 未结 8 1331
花落未央
花落未央 2020-12-03 05:08

I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extra

8条回答
  •  挽巷
    挽巷 (楼主)
    2020-12-03 05:47

    These modules you can acheive the extract text from pdf

    PDF::API2

    CAM::PDF

    CAM::PDF::PageText

    From CPAN

       my $pdf = CAM::PDF->new($filename);
       my $pageone_tree = $pdf->getPageContentTree(1);
       print CAM::PDF::PageText->render($pageone_tree);
    

    This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.

    All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.

提交回复
热议问题