问题
I'm trying to remove the punctuation from a unicode string, which may contain non-ascii letters. I tried using the regex module:
import regex
text = u"<Üäik>"
regex.sub(ur"\p{P}+", "", text)
However, I've noticed that the characters < and > don't get removed. Does anyone know why and is there any other way to strip punctuation from unicode strings?
EDIT: Another approach I've tried out is doing:
import string
text = text.encode("utf8").translate(None, string.punctuation).decode("utf8")
but I would like to avoid converting the text from unicode to string and backwards.
回答1:
< and > are classified as Math Symbols (Sm), not Punctuation (P). You can match either:
regex.sub('[\p{P}\p{Sm}]+', '', text)
The unicode.translate() method exists too and takes a dictionary mapping integer numbers (codepoints) to either other integer codepoints, a unicode character, or None; None removes that codepoint. Map string.punctuation to codepoints with ord():
text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
That only removes only the limited number of ASCII punctuation characters.
Demo:
>>> import regex
>>> text = u"<Üäik>"
>>> print regex.sub('[\p{P}\p{Sm}]+', '', text)
Üäik
>>> import string
>>> print text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
Üäik
If string.punctuation is not enough, then you can generate a complete str.translate() mapping for all P and Sm codepoints by iterating from 0 to sys.maxunicode, then test those values against unicodedata.category():
>>> import sys, unicodedata
>>> toremove = dict.fromkeys(i for i in range(0, sys.maxunicode + 1) if unicodedata.category(chr(i)).startswith(('P', 'Sm')))
>>> print text.translate(toremove)
Üäik
(For Python 3, replace unicode with str, and print ... with print(...)).
回答2:
Try string module
import string,re
text = u"<Üäik>"
out = re.sub('[%s]' % re.escape(string.punctuation), '', text)
print out
print type(out)
Prints-
Üäik
<type 'unicode'>
回答3:
\p{P} matches punctuation characters.
Those punctuation characters are
! ' # S % & ' ( ) * + , - . / : ; < = > ? @ [ / ] ^ _ { | } ~
< and > are not punctuation characters. So they won't be removed.
Try this instead
re.sub('[\p{L}<>]+',"",text)
来源:https://stackoverflow.com/questions/33787354/strip-special-characters-and-punctuation-from-a-unicode-string