removing emojis from a string in Python

问题

I found this code in Python for removing emojis but it is not working. Can you help with other codes or fix to this?

I have observed all my emjois start with \xf but when I try to search for str.startswith("\xf") I get invalid character error.

emoji_pattern = r'/[x{1F601}-x{1F64F}]/u'
re.sub(emoji_pattern, '', word)

Here's the error:

Traceback (most recent call last):
  File "test.py", line 52, in <module>
    re.sub(emoji_pattern,'',word)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/lib/python2.7/re.py", line 244, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range

Each of the items in a list can be a word ['This', 'dog', '\xf0\x9f\x98\x82', 'https://t.co/5N86jYipOI']

UPDATE: I used this other code:

emoji_pattern=re.compile(ur" " " [\U0001F600-\U0001F64F] # emoticons \
                                 |\
                                 [\U0001F300-\U0001F5FF] # symbols & pictographs\
                                 |\
                                 [\U0001F680-\U0001F6FF] # transport & map symbols\
                                 |\
                                 [\U0001F1E0-\U0001F1FF] # flags (iOS)\
                          " " ", re.VERBOSE)

emoji_pattern.sub('', word)

But this still doesn't remove the emojis and shows them! Any clue why is that?

回答1:

This works for me. It is motivated by https://stackoverflow.com/a/43813727/6579239

def deEmojify(inputString):
    return inputString.encode('ascii', 'ignore').decode('ascii')

回答2:

On Python 2, you have to use u'' literal to create a Unicode string. Also, you should pass re.UNICODE flag and convert your input data to Unicode (e.g., text = data.decode('utf-8')):

#!/usr/bin/env python
import re

text = u'This dog \U0001f602'
print(text) # with emoji

emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', text)) # no emoji

Output

This dog 😂
This dog

Note: emoji_pattern matches only some emoji (not all). See Which Characters are Emoji.

回答3:

If you're using the example from the accepted answer and still getting "bad character range" errors, then you're probably using a narrow build (see this answer for more details). A reformatted version of the regex that seems to work is:

emoji_pattern = re.compile(
    u"(\ud83d[\ude00-\ude4f])|"  # emoticons
    u"(\ud83c[\udf00-\uffff])|"  # symbols & pictographs (1 of 2)
    u"(\ud83d[\u0000-\uddff])|"  # symbols & pictographs (2 of 2)
    u"(\ud83d[\ude80-\udeff])|"  # transport & map symbols
    u"(\ud83c[\udde0-\uddff])"  # flags (iOS)
    "+", flags=re.UNICODE)

回答4:

Complete vesrion Of remove emojies:

def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

回答5:

Accepted answer, and others worked for me for a bit, but I ultimately decided to strip all characters outside of the Basic Multilingual Plane. This excludes future additions to other Unicode planes (where emoji's and such live), which means I don't have to update my code every time new Unicode characters are added :).

In Python 2.7 convert to unicode if your text is not already, and then use the negative regex below (subs anything not in regex, which is all characters from BMP except for surrogates, which are used to create 2 byte Supplementary Multilingual Plane characters).

NON_BMP_RE = re.compile(u"[^\U00000000-\U0000d7ff\U0000e000-\U0000ffff]", flags=re.UNICODE)
NON_BMP_RE.sub(u'', unicode(text, 'utf-8'))

回答6:

If you are not keen on using regex, the best solution could be using the emoji python package.

Here is a simple function to return emoji free text (thanks to this SO answer):

import emoji
def give_emoji_free_text(text):
    allchars = [str for str in text.decode('utf-8')]
    emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
    clean_text = ' '.join([str for str in text.decode('utf-8').split() if not any(i in str for i in emoji_list)])
    return clean_text

If you are dealing with strings containing emojis, this is straightforward

>> s1 = "Hi 🤔 How is your 🙈 and 😌. Have a nice weekend 💕👭👙"
>> print s1
Hi 🤔 How is your 🙈 and 😌. Have a nice weekend 💕👭👙
>> print give_emoji_free_text(s1)
Hi How is your and Have a nice weekend

If you are dealing with unicode (as in the exmaple by @jfs), just encode it with utf-8.

>> s2 = u'This dog \U0001f602'
>> print s2
This dog 😂
>> print give_emoji_free_text(s2.encode('utf8'))
This dog

Edits

Based on the comment, it should be as easy as:

def give_emoji_free_text(text):
    return emoji.get_emoji_regexp().sub(r'', text.decode('utf8'))

回答7:

I tried to collect the complete list of unicodes. I use it to extract emojis from tweets and it work very well for me.

# Emojis pattern
emoji_pattern = re.compile("["
                u"\U0001F600-\U0001F64F"  # emoticons
                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                u"\U0001F680-\U0001F6FF"  # transport & map symbols
                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                u"\U00002702-\U000027B0"
                u"\U000024C2-\U0001F251"
                u"\U0001f926-\U0001f937"
                u'\U00010000-\U0010ffff'
                u"\u200d"
                u"\u2640-\u2642"
                u"\u2600-\u2B55"
                u"\u23cf"
                u"\u23e9"
                u"\u231a"
                u"\u3030"
                u"\ufe0f"
    "]+", flags=re.UNICODE)

回答8:

Because [...] means any one of a set of characters, and because two characters in a group separated by a dash means a range of characters (often, "a-z" or "0-9"), your pattern says "a slash, followed by any characters in the group containing x, {, 1, F, 6, 0, 1, the range } through x, {, 1, F, 6, 4, f or }" followed by a slash and the letter u". That range in the middle is what re is calling the bad character range.

回答9:

this is my solution. This solution removes additional man and woman emoji which cant be renered by python 🤷‍♂ and 🤦‍♀

emoji_pattern = re.compile("["
                       u"\U0001F600-\U0001F64F"  # emoticons
                       u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                       u"\U0001F680-\U0001F6FF"  # transport & map symbols
                       u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                       u"\U00002702-\U000027B0"
                       u"\U000024C2-\U0001F251"
                       u"\U0001f926-\U0001f937"
                       u"\u200d"
                       u"\u2640-\u2642" 
                       "]+", flags=re.UNICODE)

回答10:

Converting the string into another character set like this might help:

text.encode('latin-1', 'ignore').decode('latin-1')

Kind regards.

回答11:

Here's a Python 3 script that uses the emoji library's get_emoji_regexp() - as suggested by kingmakerking and Martijn Pieters in their answer/comment.

It reads text from a file and writes the emoji-free text to another file.

import emoji
import re


def strip_emoji(text):

    print(emoji.emoji_count(text))

    new_text = re.sub(emoji.get_emoji_regexp(), r"", text)

    return new_text


with open("my_file.md", "r") as file:
    old_text = file.read()

no_emoji_text = strip_emoji(old_text)

with open("file.md", "w+") as new_file:
    new_file.write(no_emoji_text)

回答12:

Tried all the answers, unfortunately, they didn't remove the new hugging face emoji 🤗 or the clinking glasses emoji 🥂or 🤔, 🤘and a lot more.

Ended up with a list of all possible emoji, taken from the python emoji package on github, and I had to create a gist because there's a 30k character limit on stackoverflow answers and it's over 70k characters.

回答13:

Complete Version of remove Emojis
✍ 🌷 📌 👈🏻 🖥

def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)

来源：https://stackoverflow.com/questions/48860804/python-how-to-remove-all-emojis

标签

python

string

unicode

special-characters

emoji