remove unicode emoji using re in python

匿名 (未验证) 提交于 2019-12-03 02:51:02

问题:

I tried to remove the emoji from a unicode tweet text and print out the result in python 2.7 using

myre = re.compile(u'[\u1F300-\u1F5FF\u1F600-\u1F64F\u1F680-\u1F6FF\u2600-\u26FF\u2700-\u27BF]+',re.UNICODE) print myre.sub('', text) 

but it seems almost all the characters are removed from the text. I have checked several answers from other posts, unfortunately, none of them works here. Did I do anything wrong in re.compile()?

here is an example output that all the characters were removed:

“   '   //./” ! # # # … 

回答1:

You are not using the correct notation for non-BMP unicode points; you want to use \U0001FFFF, a capital U and 8 digits:

myre = re.compile(u'['     u'\U0001F300-\U0001F5FF'     u'\U0001F600-\U0001F64F'     u'\U0001F680-\U0001F6FF'     u'\u2600-\u26FF\u2700-\u27BF]+',      re.UNICODE) 

This can be reduced to:

myre = re.compile(u'['     u'\U0001F300-\U0001F64F'     u'\U0001F680-\U0001F6FF'     u'\u2600-\u26FF\u2700-\u27BF]+',      re.UNICODE) 

as your first two ranges are adjacent.

Your version was specifying (with added spaces for readability):

[\u1F30 0-\u1F5F F\u1F60 0-\u1F64 F\u1F68 0-\u1F6F F \u2600-\u26FF\u2700-\u27BF]+ 

That's because the \uxxxx escape sequence always takes only 4 hex digits, not 5.

The largest of those ranges is 0-\u1F6F (so from the digit 0 through to ), which covers a very large swathe of the Unicode standard.

The corrected expression works, provided you use a UCS-4 wide Python executable:

>>> import re >>> myre = re.compile(u'[' ...     u'\U0001F300-\U0001F64F' ...     u'\U0001F680-\U0001F6FF' ...     u'\u2600-\u26FF\u2700-\u27BF]+',  ...     re.UNICODE) >>> myre.sub('', u'Some example text with a sleepy face: \U0001f62a') u'Some example text with a sleepy face: ' 

The UCS-2 equivalent is:

myre = re.compile(u'('     u'\ud83c[\udf00-\udfff]|'     u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'     u'[\u2600-\u26FF\u2700-\u27BF])+',      re.UNICODE) 

You can combine the two into your script with a exception handler:

try:     # Wide UCS-4 build     myre = re.compile(u'['         u'\U0001F300-\U0001F64F'         u'\U0001F680-\U0001F6FF'         u'\u2600-\u26FF\u2700-\u27BF]+',          re.UNICODE) except re.error:     # Narrow UCS-2 build     myre = re.compile(u'('         u'\ud83c[\udf00-\udfff]|'         u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'         u'[\u2600-\u26FF\u2700-\u27BF])+',          re.UNICODE) 


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!