Using Java PDFBox library to write Russian PDF

后端 未结 6 1936
予麋鹿
予麋鹿 2020-12-10 13:51

I am using a Java library called PDFBox trying to write text to a PDF. It works perfect for English text, but when i tried to write Russian text inside the

6条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2020-12-10 14:41

    The long story is this - in order to do unicode output in PDF from a TrueType font, the output must include a ton of detailed and seemingly superfluous information. What it comes down to is this - inside a TrueType font the glyphs are stored as glyph ids. These glyph ids are associated with a particular unicode character (and IIRC, a unicode glyph internally may refer to several code points - like é referring to e and an acute accent - my memory is hazy). PDF doesn't really have unicode support other than to say that there exists a mapping from UTF16BE values in a string to glyph ids in a TrueType font as well as a mapping from UTF16BE values to Unicode - even if it's identity.

    • a Font dictionary of Subtype Type0 with
      • a DescendantFonts array with an entry described below
      • a ToUnicode entry that maps UTF16BE values to unicode
      • an Encoding set to Identity-H

    Output from one of my unit tests on my own tools looks like this:

    13 0 obj
    << 
       /BaseFont /DejaVuSansCondensed 
       /DescendantFonts [ 4 0 R  ]   
       /ToUnicode 14 0 R 
       /Type /Font 
       /Subtype /Type0 
       /Encoding /Identity-H 
    >> endobj
    
    14 0 obj
    << /Length 346 >> stream
    /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
    /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS
    def /CMapType 2 def 1 begincodespacerange <0000>  endcodespacerange 1
    beginbfrange <0000>  <0000> endbfrange endcmap CMapName currentdict /CMap
    defineresource pop end end
    

    endstream % note that the formatting is wrong for the stream

    • a Font dictionary of Subtype CIDFontTYpe2 with
      • a CIDSsytemInfo
      • a FontDescriptor
      • DW and W
      • a CIDToGIDMap that maps from character ID to glyph ID

    Here's the one from the same test - this is the object in the DescendantFonts array:

    4 0 obj
    << 
       /Subtype /CIDFontType2 
       /Type /Font 
       /BaseFont /DejaVuSansCondensed 
       /CIDSystemInfo 8 0 R 
       /FontDescriptor 9 0 R 
       /DW 1000 
       /W 10 0 R 
       /CIDToGIDMap 11 0 R 
    >>
    
    8 0 obj
    << 
       /Registry (Adobe)
       /Ordering (UCS)
       /Supplement 0 
    >>
    endobj
    

    Why am I telling you this? What does it have to do with PDFBox? Just this: Unicode output in PDF is, frankly, a royal pain in the butt. Acrobat was developed before there was Unicode and it was painful from the start to have CJK encodings without Unicode (I know - I worked on Acrobat then). Later Unicode support was added, but it really felt like it was glommed on. One would hope that you would just say /Encoding /Unicode and have strings that start with the thorn and y-dieresis characters and off you go. No such luck. If you don't put in every detailed thing (and really, Acrobat, embedding a PostScript program to translate to Unicode? WTH?), you get a blank page in Acrobat. I swear, I am not making this up.

    At this point, I write PDF generation tools for a separate company (.NET right now, so it won't help you), and I made it a design goal to hide all that nonsense. All text is unicode - if you only use those character codes that are the same a WinAnsi, that's what you get under the hood. Use anything else, you get all this other stuff with it. I'd be surprised if PDFBox does that work for you - it is a serious hassle.

提交回复
热议问题