How to normalize fancy-looking unicode string in C#?

☆樱花仙子☆ 提交于 2020-08-22 02:55:24

问题


I receive from a REST API a text with this kind of style, for example

  • 𝓗𝓸𝔀 𝓽𝓸 𝓻𝓮𝓶𝓸𝓿𝓮 𝓽𝓱𝓲𝓼 𝓯𝓸𝓷𝓽 𝓯𝓻𝓸𝓶 𝓪 𝓼𝓽𝓻𝓲𝓷𝓰?

  • 𝐻𝑜𝓌 𝓉𝑜 𝓇𝑒𝓂𝑜𝓋𝑒 𝓉𝒽𝒾𝓈 𝒻𝑜𝓃𝓉 𝒻𝓇𝑜𝓂 𝒶 𝓈𝓉𝓇𝒾𝓃𝑔?

  • нσω тσ яємσνє тнιѕ ƒσηт ƒяσм α ѕтяιηg?

But this is not italic or bold or underlined since the type it's string. This kind of text make it failed my Regex ^[a-zA-Z0-9._]*$

I would like to normalize this string received in a standard one in order to make my Regex still valid.


回答1:


You can use Unicode Compatibility normalization forms, which use Unicode's own (lossy) character mappings to transform letter-like characters (among other things) to their simplified equivalents.

In python, for instance:

>>> from unicodedata import normalize
>>> normalize('NFKD','𝓗𝓸𝔀 𝓽𝓸 𝓻𝓮𝓶𝓸𝓿𝓮 𝓽𝓱𝓲𝓼 𝓯𝓸𝓷𝓽 𝓯𝓻𝓸𝓶 𝓪 𝓼𝓽𝓻𝓲𝓷𝓰')
'How to remove this font from a string'

# EDIT: This one wouldn't work
>>> normalize('NFKD','нσω тσ яємσνє тнιѕ ƒσηт ƒяσм α ѕтяιηg?')
'нσω тσ яємσνє тнιѕ ƒσηт ƒяσм α ѕтяιηg?'

Interactive example here.

EDIT: Note that this only applies to stylistic forms (superscripts, blackletter, fill-width, etc.), so your third example, which uses non-latin characters, can't be decomposed to ASCII.

EDIT2: I didn't realize your question was specific to C#, here's the documentation for String.Normalize, which does just that:

string s1 = "𝓗𝓸𝔀 𝓽𝓸 𝓻𝓮𝓶𝓸𝓿𝓮 𝓽𝓱𝓲𝓼 𝓯𝓸𝓷𝓽 𝓯𝓻𝓸𝓶 𝓪 𝓼𝓽𝓻𝓲𝓷𝓰"
string s2 = s1.Normalize(NormalizationForm.FormKD)


来源:https://stackoverflow.com/questions/61959960/how-to-normalize-fancy-looking-unicode-string-in-c

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!