PHP Filter FlateDecode PDF stream returning offset characters

会有一股神秘感。 提交于 2019-12-13 07:52:03

问题


I have code that extracts text from a PDF using a filetotext class. Worked until last week when something changed in the pdf's being generated. Weird thing is that it appears the characters are there and correct once I add 29 to the ord of the character.

Example response debug printout:

/F1 7.31 Tf
0 0 0 rg
1 0 0 1 195.16 597.4 Tm
($PRXQW)Tj
ET
BT

The code uses gzuncompress on the stream section of the pdf. The $PRXQW is Amount, and adding 29dec to the ord of each character gives me this. But sometimes a character will not be this exact translation, such as what should be a ) in the text appears to be two bytes of 5C66.

Just wondering about this code ring type of character coming out of PDF's now and if anyone has seen this kind of thing?


回答1:


The encoding of the string argument of the Tj operation depends entirely on the PDF font used (F1 in the case at hand):

A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.

With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".

With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".

(section 9.4.3 "Text-Showing Operators" in ISO 32000-1)

The OP's code seems to assume a standard encoding like MacRomanEncoding or WinAnsiEncoding, but these merely are special cases. As indicated in the quote above, the encoding might as well be some ad-hoc mixed multibyte encoding.

The PDF specification in a later section describes how to properly extract text:

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):

  • If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

  • If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

    a) Map the character code to a character name according to Table D.1 and the font’s Differences array.

    b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

  • If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:

    a) Map the character code to a character identifier (CID) according to the font’s CMap.

    b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.

    c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).

    d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).

    e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

(section 9.10.2 "Mapping Character Codes to Unicode Values" in ISO 32000-1)

Thus:

Just wondering about this code ring type of character coming out of PDF's now and if anyone has seen this kind of thing?

Yes, it is fairly common in PDFs from the wild to have text drawing operator string arguments in an encoding entirely different from something ASCII'ish. And as the last paragraph in the second quote above hints at, there are situation not allowing text extraction at all (without OCR, that is), even though there are additional places one can look for the mapping to Unicode.




回答2:


What you're seeking to decode the mystery string in the most general case is /Encoding field of the selected font, in your case the font /F1. More than likely, the encoding scheme is /Identity-H, which can contain an arbitrary mapping of 16-bit characters in PDF strings onto UTF-16 characters.

Here is an example from the PDF parser I'm writing. Each page contains a dictionary of resources, which contains a dictionary of fonts:

[&3|0] => Array [
   [/Type] => |/Page|
   [/Resources] => Array [
      [/Font] => Array [
         [/F1] => |&5|0|
         [/F2] => |&7|0|
         [/F3] => |&9|0|
         [/F4] => |&14|0|
         [/F5] => |&16|0|
      ]
   ]
   [/Contents] => |&4|0|
]

In my case, /F3 was producing unusable text, so looking at /F3:

[&9|0] => Array [
    [/Type] => |/Font|
    [/Subtype] => |/Type0|
    [/BaseFont] => |/Arial|
    [/Encoding] => |/Identity-H|
    [/DescendantFonts] => |&10|0|
    [/ToUnicode] => |&96|0|
]

Here you can see the /Encoding type is /Identity-H. The mapping of characters decoding for the decoding chars used in /F3 is stored in the stream referenced by /ToUnicode. Here is the text of relevance from the stream referenced by '&96|0' (96 0 R) - The rest is omitted as boilerplate and can be ignored:

...
beginbfchar
<0003> <0020>
<000F> <002C>
<0015> <0032>
<001B> <0038>
<002C> <0049>
<003A> <0057>
endbfchar
...
beginbfrange
<0044> <0045> <0061>
<0047> <004C> <0064>
<004F> <0053> <006C>
<0055> <0059> <0072>
endbfrange
...
beginbfchar
<005C> <0079>
<00B1> <2013>
<00B6> <2019>
endbfchar
...

The 16-bit pairs between beginbfchar/endbfchar are mappings of individual characters. For example <0003> (0x0003) is mapped onto <0020> (0x0020), which is the space character.

The 16-bit triplets between beginbfrange/endbfrange are mappings of ranges of character. For example characters from <0055> (first) to <0059> (last) are mapped onto <0072>, <0073>, <0074>, <0075> and <0076> ('r' through 'v' in UTF16 & ASCII).



来源:https://stackoverflow.com/questions/32000107/php-filter-flatedecode-pdf-stream-returning-offset-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!