PdfBox text extraction not working properly

最后都变了- 提交于 2020-01-05 12:34:07

问题


PDFTextStripper stripper = new PDFTextStripper();
PDDocument document = PDDocument.load(inputStream);
String text = stripper.getText(document);

Extracted text: http://pastebin.com/BXFfMy0z

Problem pdf: http://www.iwb.ch/media/Unternehmen/Dokumente/inserat_leiter_pm.pdf

What can I do to extract correct text from this pdf file?


回答1:


In addition to @karthik27's answer:

Adobe Reader is fairly good at text extraction and, therefore, generally can be used as an indicator whether text extraction from a given document is possible at all.

Thus, whenever you have a document your own text extraction cannot handle, open it in the Reader and try copying & pasting from it. If that results in garbage, most likely it is not authored properly for text extraction, either by mistake or by design.

In case of your document I do get a semi-random collection of invisible and special characters copying and pasting from Adobe Reader like you did with PDFBox, i.e. garbage. Most likely, therefore, nothing short of OCR will allow text extraction from it.




回答2:


I think the problem is encoding.. The pdf text is encoded in different format.. if you right click on the document and click on document properties.. you can find the encoding. I think the below links will give you more explanation

link1
link2




回答3:


The original file should contain mapping to Unicode. This part is absent, thus you have got broken text after extraction.



来源:https://stackoverflow.com/questions/20068096/pdfbox-text-extraction-not-working-properly

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!