I tried to remove the emoji from a unicode tweet text and print out the result in python 2.7 using
myre = re.compile(u'[\u1F300-\u1F5FF\u1F600-\u1F64F\u1F680-\u1F6FF\u2600-\u26FF\u2700-\u27BF]+',re.UNICODE) print myre.sub('', text)
but it seems almost all the characters are removed from the text. I have checked several answers from other posts, unfortunately, none of them works here. Did I do anything wrong in re.compile()?
here is an example output that all the characters were removed:
“ ' //./” ! # # # …
You are not using the correct notation for non-BMP unicode points; you want to use \U0001FFFF
, a capital U
and 8 digits:
myre = re.compile(u'[' u'\U0001F300-\U0001F5FF' u'\U0001F600-\U0001F64F' u'\U0001F680-\U0001F6FF' u'\u2600-\u26FF\u2700-\u27BF]+', re.UNICODE)
This can be reduced to:
myre = re.compile(u'[' u'\U0001F300-\U0001F64F' u'\U0001F680-\U0001F6FF' u'\u2600-\u26FF\u2700-\u27BF]+', re.UNICODE)
as your first two ranges are adjacent.
Your version was specifying (with added spaces for readability):
[\u1F30 0-\u1F5F F\u1F60 0-\u1F64 F\u1F68 0-\u1F6F F \u2600-\u26FF\u2700-\u27BF]+
That's because the \uxxxx
escape sequence always takes only 4 hex digits, not 5.
The largest of those ranges is 0-\u1F6F
(so from the digit 0
through to
), which covers a very large swathe of the Unicode standard.
The corrected expression works, provided you use a UCS-4 wide Python executable:
>>> import re >>> myre = re.compile(u'[' ... u'\U0001F300-\U0001F64F' ... u'\U0001F680-\U0001F6FF' ... u'\u2600-\u26FF\u2700-\u27BF]+', ... re.UNICODE) >>> myre.sub('', u'Some example text with a sleepy face: \U0001f62a') u'Some example text with a sleepy face: '
The UCS-2 equivalent is:
myre = re.compile(u'(' u'\ud83c[\udf00-\udfff]|' u'\ud83d[\udc00-\ude4f\ude80-\udeff]|' u'[\u2600-\u26FF\u2700-\u27BF])+', re.UNICODE)
You can combine the two into your script with a exception handler:
try: # Wide UCS-4 build myre = re.compile(u'[' u'\U0001F300-\U0001F64F' u'\U0001F680-\U0001F6FF' u'\u2600-\u26FF\u2700-\u27BF]+', re.UNICODE) except re.error: # Narrow UCS-2 build myre = re.compile(u'(' u'\ud83c[\udf00-\udfff]|' u'\ud83d[\udc00-\ude4f\ude80-\udeff]|' u'[\u2600-\u26FF\u2700-\u27BF])+', re.UNICODE)