/Differences dictionary for encode parsing issue in PDF

僤鯓⒐⒋嵵緔 提交于 2019-12-10 18:41:42

问题


Type1 font /Differences encoding uses strings in mapping of values for example 1 character is encoded to 'one'. It is used for numbers and special characters only.

What is the standard way to use these encoding?

How should I decode string from PDF which uses such encoding?

Link for the file: http://www.filedropper.com/open


回答1:


Here's the /Differences array in your file (and honestly, you should have just posted this and not a link a skeevy download page):

/Differences [
    24 /breve/caron/circumflex/dotaccent/hungarumlaut/ogonek/ring/tilde
    39 /quotesingle
    96 /grave
    128 /bullet/dagger/daggerdbl/ellipsis...
]

The way this works is that the font also has an encoding associated with it (for example /MacRoman or /WinANSI). In the case of a Type 1 font, there is an encoding built into the font. Then given a copy of that encoding, you apply the differences to it. Start from the number (your first is 24), you change entries 24-31 inclusive to /breve, /circumflex and so on.

In Type 1 fonts, there is a dictionary called /CharStrings, which an association of a name of a glyph with the data/code that will render it. If, for example, you get a character with code 26, you look it up in your encoding array (which should be a 256 element array for Type 1 fonts) and with the differences applied, you get the name /circumflex. You then look that up in the CharStrings dictionary, pull out the glyph data and render it. Any character that does not exist in the encoding should be set to /.notdef which will then render an shape representing an undefined character (usually an empty box).

Now likely your problem is, how do I turn these glyph names in something that is more useful like, say Unicode?

If you look in Annex D, you'll see a set of tables that define the character sets for standard Latin encodings. You would make a lookup table that maps Adobe standard names to Unicode. Unfortunately, the tables in Annex D are incomplete. Fortunately, Adobe has a file that defines all of that for you here. There is a link in that file which is now dead, but most likely it was meant to go here.




回答2:


How should I decode string from PDF which uses such encoding?

As the specification explains:

9.10.2 Mapping Character Codes to Unicode Values

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods:

  • If the font dictionary contains a ToUnicode CMap, use that CMap to convert the character code to Unicode.

  • If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font:

    a) Map the character code to a character name according to Table D.1 and the font’s Differences array.

    b) Look up the character name in the Adobe Glyph List to obtain the corresponding Unicode value.

  • If the font is a composite font ... (not applicable in your case)

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

(ISO 32000-1)

First of all, therefore, you should look for a ToUnicode map.

If there is none (as in case of your sample document), use the Encoding (predefined or differences).

And if your code is not mapped to something proper in the encoding, there according to the spec is no way to determine what the character code represents!

If the font in question is embedded, you might yet have a way out by parsing the embedded font program which may include its own mapping to Unicode.

Otherwise, though, this is where you can start guessing (or delegate to OCR).


But your assumption

It is used for numbers and special characters only.

already is wrong. If you look at your sample document, e.g. the two fonts F25 and F26 used on the first page of your document have a Differences array like this:

0 /.notdef 1 /dotaccent /fi /fl /fraction /hungarumlaut /Lslash /lslash /ogonek /ring 10 /.notdef 11 /breve /minus 13 /.notdef 14 /Zcaron /zcaron /caron /dotlessi /dotlessj /ff /ffi /ffl 22 /.notdef 30 /grave /quotesingle /space /exclam /quotedbl /numbersign /dollar /percent /ampersand /quoteright /parenleft /parenright /asterisk /plus /comma /hyphen /period /slash /zero /one /two /three /four /five /six /seven /eight /nine /colon /semicolon /less /equal /greater /question /at /A /B /C /D /E /F /G /H /I /J /K /L /M /N /O /P /Q /R /S /T /U /V /W /X /Y /Z /bracketleft /backslash /bracketright /asciicircum /underscore /quoteleft /a /b /c /d /e /f /g /h /i /j /k /l /m /n /o /p /q /r /s /t /u /v /w /x /y /z /braceleft /bar /braceright /asciitilde 127 /.notdef 130 /quotesinglbase /florin /quotedblbase /ellipsis /dagger /daggerdbl /circumflex /perthousand /Scaron /guilsinglleft /OE 141 /.notdef 147 /quotedblleft /quotedblright /bullet /endash /emdash /tilde /trademark /scaron /guilsinglright /oe 157 /.notdef 159 /Ydieresis 160 /.notdef 161 /exclamdown /cent /sterling /currency /yen /brokenbar /section /dieresis /copyright /ordfeminine /guillemotleft /logicalnot /hyphen /registered /macron /degree /plusminus /twosuperior /threesuperior /acute /mu /paragraph /periodcentered /cedilla /onesuperior /ordmasculine /guillemotright /onequarter /onehalf /threequarters /questiondown /Agrave /Aacute /Acircumflex /Atilde /Adieresis /Aring /AE /Ccedilla /Egrave /Eacute /Ecircumflex /Edieresis /Igrave /Iacute /Icircumflex /Idieresis /Eth /Ntilde /Ograve /Oacute /Ocircumflex /Otilde /Odieresis /multiply /Oslash /Ugrave /Uacute /Ucircumflex /Udieresis /Yacute /Thorn /germandbls /agrave /aacute /acircumflex /atilde /adieresis /aring /ae /ccedilla /egrave /eacute /ecircumflex /edieresis /igrave /iacute /icircumflex /idieresis /eth /ntilde /ograve /oacute /ocircumflex /otilde /odieresis /divide /oslash /ugrave /uacute /ucircumflex /udieresis /yacute /thorn /ydieresis

which contains mappings for normal uppercase /A../Z and lowercase /a../z characters, too.


By the way,

Type1 font /Differences encoding uses strings in mapping of values for example 1 character is encoded to 'one'.

is not strictly correct, the '/' characters are part of the respective mapped value, e.g. /one, and as PDF objects these are not Strings but Names.



来源:https://stackoverflow.com/questions/30300827/differences-dictionary-for-encode-parsing-issue-in-pdf

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!