Using Java PDFBox library to write Russian PDF

后端未结

关注

 6  1936

予麋鹿 2020-12-10 13:51

I am using a Java library called PDFBox trying to write text to a PDF. It works perfect for English text, but when i tried to write Russian text inside the

6条回答

小蘑菇 (楼主)

2020-12-10 14:41
The long story is this - in order to do unicode output in PDF from a TrueType font, the output must include a ton of detailed and seemingly superfluous information. What it comes down to is this - inside a TrueType font the glyphs are stored as glyph ids. These glyph ids are associated with a particular unicode character (and IIRC, a unicode glyph internally may refer to several code points - like é referring to e and an acute accent - my memory is hazy). PDF doesn't really have unicode support other than to say that there exists a mapping from UTF16BE values in a string to glyph ids in a TrueType font as well as a mapping from UTF16BE values to Unicode - even if it's identity.
- a Font dictionary of Subtype Type0 with
  - a DescendantFonts array with an entry described below
  - a ToUnicode entry that maps UTF16BE values to unicode
  - an Encoding set to Identity-H
Output from one of my unit tests on my own tools looks like this:
```
13 0 obj
<< 
   /BaseFont /DejaVuSansCondensed 
   /DescendantFonts [ 4 0 R  ]   
   /ToUnicode 14 0 R 
   /Type /Font 
   /Subtype /Type0 
   /Encoding /Identity-H 
>> endobj

14 0 obj
<< /Length 346 >> stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS
def /CMapType 2 def 1 begincodespacerange <0000>  endcodespacerange 1
beginbfrange <0000>  <0000> endbfrange endcmap CMapName currentdict /CMap
defineresource pop end end
```
endstream % note that the formatting is wrong for the stream
- a Font dictionary of Subtype CIDFontTYpe2 with
  - a CIDSsytemInfo
  - a FontDescriptor
  - DW and W
  - a CIDToGIDMap that maps from character ID to glyph ID
Here's the one from the same test - this is the object in the DescendantFonts array:
```
4 0 obj
<< 
   /Subtype /CIDFontType2 
   /Type /Font 
   /BaseFont /DejaVuSansCondensed 
   /CIDSystemInfo 8 0 R 
   /FontDescriptor 9 0 R 
   /DW 1000 
   /W 10 0 R 
   /CIDToGIDMap 11 0 R 
>>

8 0 obj
<< 
   /Registry (Adobe)
   /Ordering (UCS)
   /Supplement 0 
>>
endobj
```
Why am I telling you this? What does it have to do with PDFBox? Just this: Unicode output in PDF is, frankly, a royal pain in the butt. Acrobat was developed before there was Unicode and it was painful from the start to have CJK encodings without Unicode (I know - I worked on Acrobat then). Later Unicode support was added, but it really felt like it was glommed on. One would hope that you would just say /Encoding /Unicode and have strings that start with the thorn and y-dieresis characters and off you go. No such luck. If you don't put in every detailed thing (and really, Acrobat, embedding a PostScript program to translate to Unicode? WTH?), you get a blank page in Acrobat. I swear, I am not making this up.

At this point, I write PDF generation tools for a separate company (.NET right now, so it won't help you), and I made it a design goal to hide all that nonsense. All text is unicode - if you only use those character codes that are the same a WinAnsi, that's what you get under the hood. Use anything else, you get all this other stuff with it. I'd be surprised if PDFBox does that work for you - it is a serious hassle.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...