问题
I have this helper function that gets rid of control characters in XML text:
def remove_control_characters(s): #Remove control characters in XML text
t = ""
for ch in s:
if unicodedata.category(ch)[0] == "C":
t += " "
if ch == "," or ch == "\"":
t += ""
else:
t += ch
return "".join(ch for ch in t if unicodedata.category(ch)[0]!="C")
I would like to know whether there is a unicode category for excluding quotation marks and commas.
回答1:
In Unicode, control characters general category is 'Cc', even if they have no name.unicodedata.category()
returns the general category, as you can test for yourself in the python console :
>>>unicodedata.category(unicode('\00'))
'Cc'
For commas and quotation marks, the categories are Pi and Pf. You only test the first character of the returned code in your example, so try instead :
cat = unicodedata.category(ch)
if cat == "Cc" or cat == "Pi" or cat == "Pf":
回答2:
Based on a last Unicode data file here UnicodeData.txt
Comma and Quotation mark are in Punctuation Other category Po:
002C;COMMA;Po;0;CS;;;;;N;;;;;
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
So, based on your question, your code should be something like this:
o = [c if unicodedata.category(c) != 'Cc' else ' '\
for c in xml if unicodedata.category(c) != 'Po']
return("".join(o))
If you want to find out a category for any other unicode symbol and do not want to deal with the UnicodeData.txt file, you can just print it out with a
print(c, unicodedata.category(c))
来源:https://stackoverflow.com/questions/33565552/unicode-category-for-commas-and-quotation-marks