Perl PDF line by line Parser?

风格不统一 提交于 2020-01-15 05:42:08

问题


I have a pdf, consists only of text, with no special characters nor images etc. Is there any Perl module out there (Been looking at cpan to no avail) to help me parse each page line by line? (Converting the PDF to text yields bad results and unparsable data)

Thanks,


回答1:


When I want to extract text from a PDF, I feed it to pdftohtml (part of Poppler) using the -xml output option. This produces an XML file which I parse using XML::Twig (or any other XML parser you like except XML::Simple).

The XML format is fairly simple. You get a <page> element for each page in the PDF, which contains <fontspec> elements describing the fonts used and a <text> element for each line of text. The <text> elements may contain <b> and <i> tags for bold and italic text (which is why XML::Simple can't parse it properly).

You do need to use the top and left attributes of the <text> tags to get them in the right order, because they aren't necessarily emitted in top-to-bottom order. The coordinate system has 0,0 in the upper left corner of the page with down and right being positive. Dimensions are in PostScript points (72 points per inch).



来源:https://stackoverflow.com/questions/5021737/perl-pdf-line-by-line-parser

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!