I\'m trying to remove the punctuation from a unicode string, which may contain non-ascii letters. I tried using the regex module:
import regex
text
\p{P} matches punctuation characters.
Those punctuation characters are
! ' # S % & ' ( ) * + , - . / : ; < = > ? @ [ / ] ^ _ { | } ~
< and > are not punctuation characters. So they won't be removed.
Try this instead
re.sub('[\p{L}<>]+',"",text)
< and > are classified as Math Symbols (Sm), not Punctuation (P). You can match either:
regex.sub('[\p{P}\p{Sm}]+', '', text)
The unicode.translate() method exists too and takes a dictionary mapping integer numbers (codepoints) to either other integer codepoints, a unicode character, or None; None removes that codepoint. Map string.punctuation to codepoints with ord():
text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
That only removes only the limited number of ASCII punctuation characters.
Demo:
>>> import regex
>>> text = u"<Üäik>"
>>> print regex.sub('[\p{P}\p{Sm}]+', '', text)
Üäik
>>> import string
>>> print text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
Üäik
If string.punctuation is not enough, then you can generate a complete str.translate() mapping for all P and Sm codepoints by iterating from 0 to sys.maxunicode, then test those values against unicodedata.category():
>>> import sys, unicodedata
>>> toremove = dict.fromkeys(i for i in range(0, sys.maxunicode + 1) if unicodedata.category(chr(i)).startswith(('P', 'Sm')))
>>> print text.translate(toremove)
Üäik
(For Python 3, replace unicode with str, and print ... with print(...)).
Try string module
import string,re
text = u"<Üäik>"
out = re.sub('[%s]' % re.escape(string.punctuation), '', text)
print out
print type(out)
Prints-
Üäik
<type 'unicode'>