Determine whether a PDF page contains text or is purely picture

喜欢而已 提交于 2019-12-05 04:09:22

There is no water-proof way to do what you want.

Text can appear in different ways inside a PDF file. For instance: one can draw all the glyphs using graphics state operators instead of using text state. (I'm sorry if this sounds like Chinese to you, but I can assure you it's proper PDF language.)

If an ad hoc solution that covers the most common situations and misses an exotic PDF once in a while is OK for you, then you already have a good first workaround.

In your code, you loop over all the pages, and you ask iText if there's any text on the page. That's already a good indication.

Internally, your code is using the RenderListener interface. iText parses the content of a page and triggers methods in a specific RenderListener implementation. This is an implementation of a custom implementation: MyTextRenderListener. This custom implementation is used in the ParsingHelloWorld example.

There's also a renderImage() method (see for instance MyImageListener). If this method is triggered, you're 100% sure that there's also an Image in the page, and you can use the ImageRenderInfo object to obtain the position, width and the height of the image (that is: if you know how to interpret the Matrix returned by the getImageCTM() method).

Using all these elements, you can already get a long way to achieving what you need, but be aware that there will always be exotic PDFs that will escape all your checks.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!