Extract toUnicode map from One PDF and use in another

问题

I have a Unicode PDF document which misses the toUnicode map. I have a different PDF with the same font which has the toUnicode map. Can I extract it from one PDF and use it to extract text from the other PDF?

回答1:

The generic answer is no. The ToUnicode map you are talking about follows the PDF CMap format and is used to translate character codes into Unicode values. You face two potential pitfalls:

1) The fonts are not exactly the same. While their name may be the same, they might have a different encoding, or might contain different glyphs (even for the same encoding). In that case applying the CMap from a different font would give you incorrect unicode values.

2) The fonts may be the same in all aspects but may be subsetted in the PDF file (likely) and the subset may be different. There are certainly cases where that wouldn't change the way the font is stored in the PDF file, but there are optimising PDF writers that will condense anything they can in subsetted fonts, which may give rise to different character codes being used and ultimately different ToUnicode maps.

回答2:

For Unicode mapping Adobe has special resource /ToUnicode You can find it in the pdf file inside of Font resource description. It looks like

<</BaseFont /ONWALI+Sylfaen/DescendantFonts [10 0 R]/Encoding /Identity-H/Subtype /Type0/ToUnicode 11 0 R/Type /Font>>

and /ToUnicode 11 0 R is that you need to have in the pdf file. 11 0 is a resource ID

I've created sample pdf with all alphabet symbols in Acrobat Pro to have standard ToUnicode mapping using the same font that is used in the report. I've extracted resource as text, it looks something like:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
50 beginbfchar
<0003> <0020>
...and so on...
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end

ToUnicode resource is compressed usually so you have to decompress it to get text like above.

Then I've wrote code that takes pdf (from generated report in Misrosoft Reporting) and adds /ToUnicode resource for each font found. Pdf have xref table with pointers and you cann't edit it as text file. So you have to use some pdf engine (I've used PDFTron but itext should be enough). This post-processing code is executed each time I need to save report as pdf. Actually ToUnicode mapping should be filled by Microsoft Reporting engine, but it is too good to be true.

That's it.

来源：https://stackoverflow.com/questions/13668105/extract-tounicode-map-from-one-pdf-and-use-in-another

标签

pdf

unicode