PDF Parsing with Text and Coordinates

♀尐吖头ヾ 提交于 2019-12-03 00:02:49
Mark Storer

After poking around the (hard to find) PDFBox docs, I found this little gem.

Apparently one of the examples shows exactly how to do everything you asked. Basically, you subclass PdfTextStripper and override the processTextPosition method. There, you query the TextPosition for whatever information you need.

For future reference, you can find the javaDoc here: http://pdfbox.apache.org/apidocs/index.html

Edit 2018-04-02: original link is dead, but example can be found in the SVN repo here.

One of the best things for text extraction from PDFs is TET, the text extraction toolkit. TET is part of the PDFlib.com family of products.

PDFlib.com is Thomas Merz's (the author of the "PostScript and PDF Bible") company.

TET's first incarnation is a library. That one can probably do everything you want, including to positional information about each text element on the page. Oh, and it can also extract images. It recombines+merges images which are fragmented into pieces.

pdflib.com also offers another incarnation of this technology, the TET plugin for Acrobat. Obviously you'd need Acrobat as well to make use of this.

And the third incarnation is the PDFlib TET iFilter. This is a standalone tool for user workstations. Both this is free (as in beer) to use for private, non-commercial purposes.

Lastly, TET also comes with a commandline interface.

TET is really powerful. Way better than Adobe's own text extraction. It extracted text for me where other tools (including Adobe's) do spit out garbage only.

A few months ago I tested their desktop standalone tool, and what they say on their webpage is true. It has a very good commandline. Some of my "problematic" PDF test files the tool handled to my full satisfaction.

This thing is my recommendation for every sophisticated and challenging PDF text extraction requirements.

TET is simply awesome. It detects tables. Inside tables, it identifies cells spanning multiple columns. It identifies table rows and contents of each table cell separately. It deals very well with hyphenations: it removes hyphens and restores complete words. It supports non-ASCII languages (including CJK, Arabic and Hebrew). When encountering ligatures, it restores the original characters...

Give it a try.

The GetPageText function with extract option 3 or 4 in Quick PDF Library returns a CSV string for the selected page which includes the text (either individual words or a piece of text) and the related font name, text color, text size and co-ordinates on the page.

Note: it is a commercial library and I work for the company that sells it.

Eric Kim

PDF files can be parsed with tabula-py, or tabula-java.

I made a full tutorial on how to use tabula-py on this article. You can tabula in a web-browser too as long as you have installed Java.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!