python-re.sub() and unicode

孤者浪人 提交于 2021-02-10 06:06:35

问题


I want to replace all emoji with '' but my regEx doesn't work.
For example,

content= u'?\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633?'

and I want to replace all the forms like \U0001f633 with '' so I write the code:

print re.sub(ur'\\U[0-9a-fA-F]{8}','',content)

But it doesn't work.
Thanks a lot.


回答1:


You won't be able to recognize properly decoded unicode codepoints that way (as strings containing \uXXXX, etc.) Properly decoded, by the time the regex parser gets to them, each is a* character.

Depending on whether your python was compiled with only 16-bit unicode code points or not, you'll want a pattern something like either:

# 16-bit codepoints
re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')

# 32-bit* codepoints
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')

And your code would look like:

import re

# Pick a pattern, adjust as necessary
#re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')

content= u'[\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633]'
print(content)

stripped = re_strip.sub('', content)
print(stripped)

Both expressions, reduce the number of characters in the stripped string to 26.

These expressions strip out the emojis you were after, but may also strip out other things you do want. It may be worth reviewing a unicode codepoint range listing (e.g. here) and adjusting them.

You can determine whether your python install will only recognize 16-bit codepoints by doing something like:

import sys
print(sys.maxunicode.bit_length())

If this displays 16, you'll need the first regex expression. If it displays something greater than 16 (for me it says 21), the second one is what you want.

Neither expression will work when used on a python install with the wrong sys.maxunicode.

See also: this related.



来源:https://stackoverflow.com/questions/38681921/python-re-sub-and-unicode

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!