Text extraction is empty and unknown for text has type3 font using PDFBox,iText (difficult topic!)

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-02 12:48:26

A Type 3 font is a user-defined font. For instance: a user can define that the character 'P' corresponds with the symbol for "The Artist Formerly Known As Prince" (TAFKAP) which is a glyph, but not a letter from any known alphabet.

A glyph in a Type 3 font is a series of lines and shapes, and there's no way for a program such as iText or PDFBox to determine which character was meant. It is only normal that you get a question mark. For instance: which character would you use for this symbol?

One of the following reasons applies for a PDF that contains Type 3 fonts:

  1. The font was used to introduce symbols that don't exist in any font.
  2. The font was used to obfuscate the content of the PDF so that its content can't be extracted.
  3. The PDF wasn't created in an elegant way.

If the Type 3 font was used for normal characters, you'll need to use OCR to convert the content to normal text.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!