Error when extracting text from pdf using pdfbox

问题

sample pdf

Sample pdf is a chinese resume, 3 pages, using standard code below

PDDocument document =  PDDocument.load(new File(path));
PDFTextStripper stripper = new PDFTextStripper();
text = stripper.getText(document);

Extraction result is like below image, only some words

回答1:

If you run the text extraction code and enable logging, you'll see numerous warnings:

Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+5482 (5482) in font GNPVNR+PingFangSC-Semibold
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+1842 (1842) in font GNPVNR+PingFangSC-Semibold
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+7566 (7566) in font GNPVNR+PingFangSC-Semibold
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+1915 (1915) in font GNPVNR+PingFangSC-Semibold
...

Indeed, when inspecting the PDF one sees that there are numerous subsets of PingFangSC styles embedded but each time

with a ToUnicode map without any entries at all,
with an Identity-H encoding, and
with an Adobe-Identity-0 ROS,

i.e. without any information which glyph represents which Unicode code point. Thus, it should not surprise at all that text extraction results are very lacking.

So if you really need to extract the text, ask the source of the PDF to provide a copy which includes the required information. If that is not possible, try OCR.

By the way, a good first check usually is to try and copy&paste the text from Adobe Reader. In the case at hand that also results in mostly missing characters. That usually means that the information required for text extraction according to the PDF specification is missing.

You'll also find some more backgrounds at the link @Tilman provided in a comment: https://pdfbox.apache.org/2.0/faq.html#text-extraction

来源：https://stackoverflow.com/questions/54644435/error-when-extracting-text-from-pdf-using-pdfbox

标签

java

pdfbox