问题
I have a file with sentences, some of which are in Spanish and contain accented letters (e.g. é) or special characters (e.g. ¿). I have to be able to search for these characters in the sentence so I can determine if the sentence is in Spanish or English.
I've tried my best to accomplish this, but have had no luck in getting it right. Below is one of the solutions I tried, but clearly gave the wrong answer.
sentence = ¿Qué tipo es el? #in str format, received from standard open file method
sentence = sentence.decode('latin-1')
print 'é'.decode('latin-1') in sentence
>>> False
I've also tried using codecs.open(.., .., 'latin-1') to read in the file instead, but that didn't help. Then I tried u'é'.encode('latin-1'), and that didn't work.
I'm out of ideas here, any suggestions?
@icktoofay provided the solution. I ended up keeping the decoding of the file (using latin-1), but then using the Python unicode for the characters (u'é'
). This required me to set the Python unicode encoding at the top of the script. The final step was to use the unicodedata.normalize
method to normalize both strings, then compare accordingly. Thank you guys for the prompt and great support.
回答1:
Use unicodedata.normalize on the string before checking.
Explanation
Unicode offers multiple forms to create some characters. For example, á
could be represented with a single character, á
, or two characters: a
, then 'put a ´
on top of that'. Normalizing the string will force it to one or the other of the representations. (which representation it normalizes to depends on what you pass as the form
parameter)
回答2:
I suspect your terminal is using UTF-8, so 'é'.decode('latin-1')
is incorrect. Just use a Unicode constant instead u'é'
.
To handle Unicode correctly in a script, declare the script and data file encodings, and decode incoming data, and encode outgoing data. Using Unicode strings for text in the script.
Example (save script in UTF-8):
# coding: utf8
import codecs
with codecs.open('input.txt',encoding='latin-1') as f:
sentence = f.readline()
if u'é' in sentence:
print u'Found é'
Note that print
implicitly encodes the output in the terminal encoding.
来源:https://stackoverflow.com/questions/13325753/how-to-find-accented-characters-in-a-string-in-python