PdfBox text extraction not working properly

问题

PDFTextStripper stripper = new PDFTextStripper();
PDDocument document = PDDocument.load(inputStream);
String text = stripper.getText(document);

Extracted text: http://pastebin.com/BXFfMy0z

Problem pdf: http://www.iwb.ch/media/Unternehmen/Dokumente/inserat_leiter_pm.pdf

What can I do to extract correct text from this pdf file?

回答1:

In addition to @karthik27's answer:

Adobe Reader is fairly good at text extraction and, therefore, generally can be used as an indicator whether text extraction from a given document is possible at all.

Thus, whenever you have a document your own text extraction cannot handle, open it in the Reader and try copying & pasting from it. If that results in garbage, most likely it is not authored properly for text extraction, either by mistake or by design.

In case of your document I do get a semi-random collection of invisible and special characters copying and pasting from Adobe Reader like you did with PDFBox, i.e. garbage. Most likely, therefore, nothing short of OCR will allow text extraction from it.

回答2:

I think the problem is encoding.. The pdf text is encoded in different format.. if you right click on the document and click on document properties.. you can find the encoding. I think the below links will give you more explanation

link1
link2

回答3:

The original file should contain mapping to Unicode. This part is absent, thus you have got broken text after extraction.

来源：https://stackoverflow.com/questions/20068096/pdfbox-text-extraction-not-working-properly

标签

java

pdf

pdfbox