python isalpha doesn't handle unicode combing marks properly?

烈酒焚心 提交于 2019-12-22 17:52:53

问题


I encountered weird ukrainian word Кири́лл. I converted it to unicode and tested it with isalpha, which returned False. I looked around and found that this word contains character named 'combining acute accent'. So the letter и́ is actually a combination of two characters: и and ́. If I understood it correctly, combining marks (like this acute accent) are intended only to modify other characters. So isalpha should recognize this string as a word. Am I wrong? Is there any way to get correct results? The word in question in utf8:

word = '\xd0\x9a\xd0\xb8\xd1\x80\xd0\xb8\xcc\x81\xd0\xbb\xd0\xbb'


回答1:


I think you will need to replace the strings of any modifier characters since a modifier character is not considered alpha

modifiers = "\xcc\x81|<OTHER>|<MODIFIERS>"

text_to_analyze = re.sub(modifiers,"",my_text)
print unicode(text_to_analyze,"utf8").isalpha()


来源:https://stackoverflow.com/questions/21920882/python-isalpha-doesnt-handle-unicode-combing-marks-properly

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!