Maintaining the consistency of strings before and after converting to ASCII

问题

I have many strings in unicode format such as carbon copolymers—III\n12- Géotechnique\n and many more having many different unicode characters, in a string variable named txtWords.

My goal is to remove all non-ASCII characters while preserving the consistency of the strings. For instance I want to first sentence turn into carbon copolymers III or carbon copolymers iii (no case-sensitivity here) and the second one to geotechnique\n and so on ...

Currently I am using the following code but it doesn't help me achieve what I expect. The current code changes carbon copolymers III to carbon copolymersiii which is definitely not what it should be:

import unicodedata, re
txtWords = unicodedata.normalize('NFKD', txtWords.lower()).encode('ascii','ignore')
txtWords = re.sub(r'[^a-z^\n]',r' ',txtWords)

If I use the regex code first then I get something worse (in terms of what I expect):

    import unicodedata, re
    txtWords = re.sub(r'[^a-z^\n]',r' ',txtWords)
    txtWords = unicodedata.normalize('NFKD', txtWords.lower()).encode('ascii','ignore')

This way, for the string Géotechnique\n I get otechnique!

How can I resolve this issue?

回答1:

Use the \w regular expression to strip non-alphanumerics before the decomposing trick:

#coding:utf8
from __future__ import unicode_literals,print_function
import unicodedata as ud
import re
txtWords = 'carbon copolymers—III\n12- Géotechnique\n'
txtWords = re.sub(r'[^\w\n]',r' ',txtWords.lower(),flags=re.U)
txtWords = ud.normalize('NFKD',txtWords).encode('ascii','ignore').decode()
print(txtWords)

Output (Python 2 and 3):

carbon copolymers iii
12  geotechnique

来源：https://stackoverflow.com/questions/33990023/maintaining-the-consistency-of-strings-before-and-after-converting-to-ascii

标签

regex

string

python-2.7

unicode

consistency