Emoji not detected with python regular expression in Linux

这一生的挚爱 提交于 2019-12-22 19:52:29

问题


I have a regular expression to detect emojis :

emoji = u'(\ud83c[\udf00-\udfff]|\ud83d[\udc00-\ude4f\ude80-\udeff]|[\u2600-\u26FF\u2700-\u27BF])'

and I test with this command: re.match(emoji, u'\U0001f602', re.UNICODE) # "😂"

The problem is that if finds a match in my macOs machine, but not on Linux Debian

Using ipython 4.0.1 and Python 2.7.11 Both from conda distribution.

Why the problem matchin on Linux ?


回答1:


Your Mac OS has a narrow python build. Try this on it:

unichr(0x0001f602)

I expect you'll get an exception. It means that your Mac python install is treating unicode characters above FFFF as two characters.

>>> u'\ud83d\ude02'.encode('utf8')
'\xf0\x9f\x98\x82'

>>> u'\U0001f602'.encode('utf8')
'\xf0\x9f\x98\x82'

>>> re.match(emoji, u'\ud83d\ude02', re.UNICODE)
<_sre.SRE_Match object at 0x7fdf7405d6c0>

Notice how \ud83d\ude02 and \U0001f602 produce the same bytes. Your Mac OS is treating the character \U0001f602 as the two 8 hex digits \ud83d\ude02, which matches your regex. Linux is treating it as a single 16 hex digit, which doesn't match any of the ranges in your regex.

Your options are: 1) add the following range of characters to your regex under Linux:

ur'[\U0001F600-\U0001F64F]'

But it will break the regex under Mac OS, as per here.

2) switch to Python 3.

3) rebuild your python on Mac with the --enable-unicode=ucs4 option.



来源:https://stackoverflow.com/questions/34679514/emoji-not-detected-with-python-regular-expression-in-linux

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!