Strip special characters and punctuation from a unicode string

≡放荡痞女 提交于 2020-01-30 11:30:48

问题


I'm trying to remove the punctuation from a unicode string, which may contain non-ascii letters. I tried using the regex module:

import regex
text = u"<Üäik>"
regex.sub(ur"\p{P}+", "", text)

However, I've noticed that the characters < and > don't get removed. Does anyone know why and is there any other way to strip punctuation from unicode strings?

EDIT: Another approach I've tried out is doing:

import string
text = text.encode("utf8").translate(None, string.punctuation).decode("utf8")

but I would like to avoid converting the text from unicode to string and backwards.


回答1:


< and > are classified as Math Symbols (Sm), not Punctuation (P). You can match either:

regex.sub('[\p{P}\p{Sm}]+', '', text)

The unicode.translate() method exists too and takes a dictionary mapping integer numbers (codepoints) to either other integer codepoints, a unicode character, or None; None removes that codepoint. Map string.punctuation to codepoints with ord():

text.translate(dict.fromkeys(ord(c) for c in string.punctuation))

That only removes only the limited number of ASCII punctuation characters.

Demo:

>>> import regex
>>> text = u"<Üäik>"
>>> print regex.sub('[\p{P}\p{Sm}]+', '', text)
Üäik
>>> import string
>>> print text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
Üäik

If string.punctuation is not enough, then you can generate a complete str.translate() mapping for all P and Sm codepoints by iterating from 0 to sys.maxunicode, then test those values against unicodedata.category():

>>> import sys, unicodedata
>>> toremove = dict.fromkeys(i for i in range(0, sys.maxunicode + 1) if unicodedata.category(chr(i)).startswith(('P', 'Sm')))
>>> print text.translate(toremove)
Üäik

(For Python 3, replace unicode with str, and print ... with print(...)).




回答2:


Try string module

import string,re
text = u"<Üäik>"
out = re.sub('[%s]' % re.escape(string.punctuation), '', text)
print out
print type(out)

Prints-

Üäik
<type 'unicode'>



回答3:


\p{P} matches punctuation characters.

Those punctuation characters are

! ' # S % & ' ( ) * + , - . / : ; < = > ? @ [ / ] ^ _ { | } ~

< and > are not punctuation characters. So they won't be removed.

Try this instead

re.sub('[\p{L}<>]+',"",text)


来源:https://stackoverflow.com/questions/33787354/strip-special-characters-and-punctuation-from-a-unicode-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!