Error when extracting text from pdf using pdfbox

百般思念 提交于 2021-01-29 02:50:38

问题


sample pdf

Sample pdf is a chinese resume, 3 pages, using standard code below

PDDocument document =  PDDocument.load(new File(path));
PDFTextStripper stripper = new PDFTextStripper();
text = stripper.getText(document);

Extraction result is like below image, only some words


回答1:


If you run the text extraction code and enable logging, you'll see numerous warnings:

Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+5482 (5482) in font GNPVNR+PingFangSC-Semibold
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+1842 (1842) in font GNPVNR+PingFangSC-Semibold
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+7566 (7566) in font GNPVNR+PingFangSC-Semibold
Feb 12, 2019 5:45:58 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARN: No Unicode mapping for CID+1915 (1915) in font GNPVNR+PingFangSC-Semibold
...

Indeed, when inspecting the PDF one sees that there are numerous subsets of PingFangSC styles embedded but each time

  • with a ToUnicode map without any entries at all,
  • with an Identity-H encoding, and
  • with an Adobe-Identity-0 ROS,

i.e. without any information which glyph represents which Unicode code point. Thus, it should not surprise at all that text extraction results are very lacking.

So if you really need to extract the text, ask the source of the PDF to provide a copy which includes the required information. If that is not possible, try OCR.


By the way, a good first check usually is to try and copy&paste the text from Adobe Reader. In the case at hand that also results in mostly missing characters. That usually means that the information required for text extraction according to the PDF specification is missing.

You'll also find some more backgrounds at the link @Tilman provided in a comment: https://pdfbox.apache.org/2.0/faq.html#text-extraction



来源:https://stackoverflow.com/questions/54644435/error-when-extracting-text-from-pdf-using-pdfbox

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!