Extract toUnicode map from One PDF and use in another

半城伤御伤魂 提交于 2020-01-23 13:34:07

问题


I have a Unicode PDF document which misses the toUnicode map. I have a different PDF with the same font which has the toUnicode map. Can I extract it from one PDF and use it to extract text from the other PDF?


回答1:


The generic answer is no. The ToUnicode map you are talking about follows the PDF CMap format and is used to translate character codes into Unicode values. You face two potential pitfalls:

1) The fonts are not exactly the same. While their name may be the same, they might have a different encoding, or might contain different glyphs (even for the same encoding). In that case applying the CMap from a different font would give you incorrect unicode values.

2) The fonts may be the same in all aspects but may be subsetted in the PDF file (likely) and the subset may be different. There are certainly cases where that wouldn't change the way the font is stored in the PDF file, but there are optimising PDF writers that will condense anything they can in subsetted fonts, which may give rise to different character codes being used and ultimately different ToUnicode maps.




回答2:


For Unicode mapping Adobe has special resource /ToUnicode You can find it in the pdf file inside of Font resource description. It looks like

<</BaseFont /ONWALI+Sylfaen/DescendantFonts [10 0 R]/Encoding /Identity-H/Subtype /Type0/ToUnicode 11 0 R/Type /Font>>

and /ToUnicode 11 0 R is that you need to have in the pdf file. 11 0 is a resource ID

I've created sample pdf with all alphabet symbols in Acrobat Pro to have standard ToUnicode mapping using the same font that is used in the report. I've extracted resource as text, it looks something like:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
50 beginbfchar
<0003> <0020>
...and so on...
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end

ToUnicode resource is compressed usually so you have to decompress it to get text like above.

Then I've wrote code that takes pdf (from generated report in Misrosoft Reporting) and adds /ToUnicode resource for each font found. Pdf have xref table with pointers and you cann't edit it as text file. So you have to use some pdf engine (I've used PDFTron but itext should be enough). This post-processing code is executed each time I need to save report as pdf. Actually ToUnicode mapping should be filled by Microsoft Reporting engine, but it is too good to be true.

That's it.



来源:https://stackoverflow.com/questions/13668105/extract-tounicode-map-from-one-pdf-and-use-in-another

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!