Convert hexadecimal character (ligature) to utf-8 character

让人想犯罪 __ 提交于 2021-02-07 13:54:17

问题


I had a text content which is converted from a pdf file. There are some unwanted character in the text and I want to convert them to utf-8 characters.

For instance; 'Artificial Immune System' is converted like 'Articial Immune System'. is converted like a one character and I used gdex to learn the ascii value of the character but I don't know how to replace it with the real value in the all content.


回答1:


I guess what you're seeing are ligatures — professional fonts have glyps that combine several individual characters into a single (better looking) glyph. So instead of writing "f" and "i", as two glyphs, the font has a single "fi" glyph. Compare "fi" (two letters) with "fi" (single glyph).

In Python, you can use the unicodedata module to manipute late Unicode text. You can also exploit the conversion to NFKD normal form to split ligatures:

>>> import unicodedata
>>> unicodedata.name(u'\uFB01')
'LATIN SMALL LIGATURE FI'
>>> unicodedata.normalize("NFKD", u'Arti\uFB01cial Immune System')
u'Artificial Immune System'

So normalizing your strings with NFKD should help you along. If you find that this splits too much, then my best suggestion is to make a small mapping table of the ligatures you want to split and replace the ligatures manually:

>>> ligatures = {0xFB00: u'ff', 0xFB01: u'fi'}
>>> u'Arti\uFB01cial Immune System'.translate(ligatures)
u'Artificial Immune System'

Refer to the Wikipedia article to get a list of ligatures in Unicode.



来源:https://stackoverflow.com/questions/9175073/convert-hexadecimal-character-ligature-to-utf-8-character

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!