问题
sample pdf
Sample pdf is a chinese resume, 3 pages, using standard code below
PDDocument document = PDDocument.load(new File(path));
PDFTextStripper stripper = new PDFTextStripper();
text = stripper.getText(document);
Extraction result is like below image, only some words
回答1:
If you run the text extraction code and enable logging, you'll see numerous warnings:
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+5482 (5482) in font GNPVNR+PingFangSC-Semibold
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+1842 (1842) in font GNPVNR+PingFangSC-Semibold
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+7566 (7566) in font GNPVNR+PingFangSC-Semibold
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+1915 (1915) in font GNPVNR+PingFangSC-Semibold
...
Indeed, when inspecting the PDF one sees that there are numerous subsets of PingFangSC styles embedded but each time
- with a ToUnicode map without any entries at all,
- with an Identity-H encoding, and
- with an Adobe-Identity-0 ROS,
i.e. without any information which glyph represents which Unicode code point. Thus, it should not surprise at all that text extraction results are very lacking.
So if you really need to extract the text, ask the source of the PDF to provide a copy which includes the required information. If that is not possible, try OCR.
By the way, a good first check usually is to try and copy&paste the text from Adobe Reader. In the case at hand that also results in mostly missing characters. That usually means that the information required for text extraction according to the PDF specification is missing.
You'll also find some more backgrounds at the link @Tilman provided in a comment: https://pdfbox.apache.org/2.0/faq.html#text-extraction
来源:https://stackoverflow.com/questions/54644435/error-when-extracting-text-from-pdf-using-pdfbox