问题
I have a regular expression to detect emojis :
emoji = u'(\ud83c[\udf00-\udfff]|\ud83d[\udc00-\ude4f\ude80-\udeff]|[\u2600-\u26FF\u2700-\u27BF])'
and I test with this command:
re.match(emoji, u'\U0001f602', re.UNICODE) # "😂"
The problem is that if finds a match in my macOs machine, but not on Linux Debian
Using ipython 4.0.1 and Python 2.7.11 Both from conda distribution.
Why the problem matchin on Linux ?
回答1:
Your Mac OS has a narrow python build. Try this on it:
unichr(0x0001f602)
I expect you'll get an exception. It means that your Mac python install is treating unicode characters above FFFF as two characters.
>>> u'\ud83d\ude02'.encode('utf8')
'\xf0\x9f\x98\x82'
>>> u'\U0001f602'.encode('utf8')
'\xf0\x9f\x98\x82'
>>> re.match(emoji, u'\ud83d\ude02', re.UNICODE)
<_sre.SRE_Match object at 0x7fdf7405d6c0>
Notice how \ud83d\ude02 and \U0001f602 produce the same bytes. Your Mac OS is treating the character \U0001f602 as the two 8 hex digits \ud83d\ude02, which matches your regex. Linux is treating it as a single 16 hex digit, which doesn't match any of the ranges in your regex.
Your options are: 1) add the following range of characters to your regex under Linux:
ur'[\U0001F600-\U0001F64F]'
But it will break the regex under Mac OS, as per here.
2) switch to Python 3.
3) rebuild your python on Mac with the --enable-unicode=ucs4 option.
来源:https://stackoverflow.com/questions/34679514/emoji-not-detected-with-python-regular-expression-in-linux