Remove all symbols while preserving string consistency [duplicate]

夙愿已清 提交于 2019-12-13 06:38:55

问题


My goal is to remove all symbols from a string and still preserve the unicode characters (alphabetical character from any language). Suppose I have the following string:

carbon copolymers—III❏£\n12- Géotechnique\n

I want to remove the , and £ characters between copolymers and \n. I was looking at here and thought maybe I should go with regex and remove all symbols given the correct unicode characters range. The range of characters that I have in my text file varies from Latin to Russian and ... . However the regex code I've written below doesn't help.

>>> s = u'carbon copolymers—III❏£\n12- Géotechnique\n'
>>> re.sub(ur'[^\u0020-\u00FF\n]+',' ', s)

There seems to be two problems with this method:

1) Different unicode ranges still include some symbols.

2) Sometimes, for some unknown reason the returned result seems to be totally different than what it is supposed to be.

Here's the result of the code above:

carbon copolymers\xe2\x80\x94III\n12- G\xc3\xa9otechnique\n
>>> print u'carbon copolymers\xe2\x80\x94III\n12- G\xc3\xa9otechnique\n'
carbon copolymersâIII
12- Géotechnique 

Do you know any better way of doing this? Is there a full list of all symbols? Do you have any other ideas rather than regex?

Thank you


回答1:


I think found a good solution (>99% robust I believe) to the problem:

Well here's our new, horrific string:

s = u'carbon҂ ҉ copolymers—⿴٬ٯ٪III❏£\n12-ः׶ Ǣ ܊ܔ ۩۝۞ء܅۵Géotechnique▣ऀ\n'

And here's the resulting string:

u'carbon    copolymers   \u066f III  \n      \u01e2  \u0714    \u0621  G\xe9otechnique  \n'

All the remained characters/words are in fact alphabetical characters, in different languages. Done with almost no effort!

Here's the solution:

s = ''.join([c if c.isalpha() or c.isspace() else ' ' for c in s])
s = re.sub(ur'[\u0020-\u0040]+|[\u005B-\u0060]+|[\u007B-\u00BF]+', ' ', s)
s = re.sub(r'[ ]+', ' ', s)
carbon copolymers ٯ III  
Ǣ ܔ ء Géotechnique  


来源:https://stackoverflow.com/questions/34034225/remove-all-symbols-while-preserving-string-consistency

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!