Why the text extracted from PDF using PDF text extractors for java such as PDFBox , itext are scatted and unstructured?

谁说我不能喝 提交于 2019-12-05 18:28:42

The PDF format is designed to allow a document to be displayed and printed correctly, not to allow structured access to the text content. Extracting text from a PDF document is similar to running the printed page through an OCR software. You may not have to recognize the glyphs and convert them to characters, but the structure and logical text flow of the document must be estimated.

If you don't use the naive text extraction examples, both iText and PDFBox (if I remember correctly) give you much more detailed access to the document elements. In this case you would both need the text content as well as the position on the page to be able to reconstruct the content in a meaningful way.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!