How to find accented characters in a string in Python?

佐手、 提交于 2021-02-07 12:56:52

问题


I have a file with sentences, some of which are in Spanish and contain accented letters (e.g. é) or special characters (e.g. ¿). I have to be able to search for these characters in the sentence so I can determine if the sentence is in Spanish or English.

I've tried my best to accomplish this, but have had no luck in getting it right. Below is one of the solutions I tried, but clearly gave the wrong answer.

sentence = ¿Qué tipo es el? #in str format, received from standard open file method
sentence = sentence.decode('latin-1')
print 'é'.decode('latin-1') in sentence
>>> False

I've also tried using codecs.open(.., .., 'latin-1') to read in the file instead, but that didn't help. Then I tried u'é'.encode('latin-1'), and that didn't work.

I'm out of ideas here, any suggestions?

@icktoofay provided the solution. I ended up keeping the decoding of the file (using latin-1), but then using the Python unicode for the characters (u'é'). This required me to set the Python unicode encoding at the top of the script. The final step was to use the unicodedata.normalize method to normalize both strings, then compare accordingly. Thank you guys for the prompt and great support.


回答1:


Use unicodedata.normalize on the string before checking.

Explanation

Unicode offers multiple forms to create some characters. For example, á could be represented with a single character, á, or two characters: a, then 'put a ´ on top of that'. Normalizing the string will force it to one or the other of the representations. (which representation it normalizes to depends on what you pass as the form parameter)




回答2:


I suspect your terminal is using UTF-8, so 'é'.decode('latin-1') is incorrect. Just use a Unicode constant instead u'é'.

To handle Unicode correctly in a script, declare the script and data file encodings, and decode incoming data, and encode outgoing data. Using Unicode strings for text in the script.

Example (save script in UTF-8):

# coding: utf8
import codecs
with codecs.open('input.txt',encoding='latin-1') as f:
    sentence = f.readline()
if u'é' in sentence:
    print u'Found é'

Note that print implicitly encodes the output in the terminal encoding.



来源:https://stackoverflow.com/questions/13325753/how-to-find-accented-characters-in-a-string-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!