I am using a Java library called PDFBox trying to write text to a PDF. It works perfect for English text, but when i tried to write Russian text inside the
The long story is this - in order to do unicode output in PDF from a TrueType font, the output must include a ton of detailed and seemingly superfluous information. What it comes down to is this - inside a TrueType font the glyphs are stored as glyph ids. These glyph ids are associated with a particular unicode character (and IIRC, a unicode glyph internally may refer to several code points - like é referring to e and an acute accent - my memory is hazy). PDF doesn't really have unicode support other than to say that there exists a mapping from UTF16BE values in a string to glyph ids in a TrueType font as well as a mapping from UTF16BE values to Unicode - even if it's identity.
Output from one of my unit tests on my own tools looks like this:
13 0 obj
<<
/BaseFont /DejaVuSansCondensed
/DescendantFonts [ 4 0 R ]
/ToUnicode 14 0 R
/Type /Font
/Subtype /Type0
/Encoding /Identity-H
>> endobj
14 0 obj
<< /Length 346 >> stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def /CMapName /Adobe-Identity-UCS
def /CMapType 2 def 1 begincodespacerange <0000> endcodespacerange 1
beginbfrange <0000> <0000> endbfrange endcmap CMapName currentdict /CMap
defineresource pop end end
endstream % note that the formatting is wrong for the stream
Here's the one from the same test - this is the object in the DescendantFonts array:
4 0 obj
<<
/Subtype /CIDFontType2
/Type /Font
/BaseFont /DejaVuSansCondensed
/CIDSystemInfo 8 0 R
/FontDescriptor 9 0 R
/DW 1000
/W 10 0 R
/CIDToGIDMap 11 0 R
>>
8 0 obj
<<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>>
endobj
Why am I telling you this? What does it have to do with PDFBox? Just this: Unicode output in PDF is, frankly, a royal pain in the butt. Acrobat was developed before there was Unicode and it was painful from the start to have CJK encodings without Unicode (I know - I worked on Acrobat then). Later Unicode support was added, but it really felt like it was glommed on. One would hope that you would just say /Encoding /Unicode and have strings that start with the thorn and y-dieresis characters and off you go. No such luck. If you don't put in every detailed thing (and really, Acrobat, embedding a PostScript program to translate to Unicode? WTH?), you get a blank page in Acrobat. I swear, I am not making this up.
At this point, I write PDF generation tools for a separate company (.NET right now, so it won't help you), and I made it a design goal to hide all that nonsense. All text is unicode - if you only use those character codes that are the same a WinAnsi, that's what you get under the hood. Use anything else, you get all this other stuff with it. I'd be surprised if PDFBox does that work for you - it is a serious hassle.