replace emoji unicode symbol using regexp in javascript

♀尐吖头ヾ 提交于 2019-11-28 23:45:12
Jukka K. Korpela

The \u.... notation has four hex digits, no less, no more, so it can only represent code points up to U+FFFF. Unicode characters above that are represented as pairs of surrogate code points.

So some indirect approach is needed. Cf. to JavaScript strings outside of the BMP.

For example, you could look for code points in the range [\uD800-\uDBFF] (high surrogates), and when you find one, check that the next code point in the string is in the range [\uDC00-\uDFFF] (if not, there is a serious data error), interpret the two as a Unicode character, and replace them by whatever you wish to put there. This looks like a job for a simple loop through the string, rather than a regular expression.

maybe you can take a look of this article: http://crocodillon.com/blog/parsing-emoji-unicode-in-javascript

the emoji unicode from \u1F601 to \u1F64F

translate to javascript's utf-16 is \ud83d\ude00 to \ud83d\ude4f

the first char is always \ud83d.

so the reg is out:

/\ud83d[\ude00-\ude4f]/g

hope this can make some help

This is somewhat old, but I was looking into this problem and it seems Bradley Momberger has posted a nice solution to it here: http://airhadoken.github.io/2015/04/22/javascript-string-handling-emoji.html

The regex he proposes is:

/[\uD800-\uDFFF]./ // This matches emoji

This regex matches the head surrogate, which is used by emojis, and the charracter following the head surrogate (which is assumed to be the tail surrogate). Thus, all emojis should be matched correctly and with

.replace(/[\uD800-\uDFFF]./g,'')

you should be able to remove all emojis.

Edit: Better regex found. The above regex misses some emojis.

But there is a reddit post with a version, for which i cannot find an emoji, that is excepted from the rule. The reddit is here: https://www.reddit.com/r/tasker/comments/4vhf2f/how_to_regex_emojis_in_tasker_for_search_match_or/ And the regex is:

/[\uD83C-\uDBFF\uDC00-\uDFFF]+/

To match all occurences, use the g modifier:

/[\uD83C-\uDBFF\uDC00-\uDFFF]+/g

Second Edit: As CodeToad pointed out correctly, ✨ is not recognized by the above Regex, because it's in the dingbats block (thanks to air_hadoken).

The lodash library came up with an excellent Emoji Regex block:

(?:[\u2700-\u27bf]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff])[\ufe0e\ufe0f]?(?:[\u0300-\u036f\ufe20-\ufe23\u20d0-\u20f0]|\ud83c[\udffb-\udfff])?(?:\u200d(?:[^\ud800-\udfff]|(?:\ud83c[\udde6-\uddff]){2}|[\ud800-\udbff][\udc00-\udfff])[\ufe0e\ufe0f]?(?:[\u0300-\u036f\ufe20-\ufe23\u20d0-\u20f0]|\ud83c[\udffb-\udfff])?)*

Kevin Scott nicely put together, what this regex covers in his Blog Post. Spoiler: it includes dingbats 🎉

  1. /\ud83d[\ude00-\ude4f]/g

not including all emojis like : 👿 👹 👺 💀 👻 👽 🤖 💩, see http://getemoji.com/ and try your regex https://regex101.com/

  1. /[\uD83C-\uDBFF\uDC00-\uDFFF]+/g

not including all emojis like : ⛑ ☕️ ☁️☄️ ☀️☃️ ⛄️ ❄️ ☹️☺️⛩⛱™️ ©️ ®️ 〰️ ➰ ➿

  1. Even this regex does not allow you to remove all emojis... 🖥 🖨 🖱 🖲 🕹 🗜 :

https://github.com/nizaroni/emoji-strip/blob/master/dist/emoji-strip.js#L79

Then, can you say why you think these regex is bad to remove all exotic characters and emojis ?

/[\u1000-\uFFFF]+/g

Below regex pattern worked for me in java.

"[\ud83c\udc00-\ud83c\udfff]|[\ud83d\udc00-\ud83d\udfff]|[\u2600-\u27ff]"

As java String uses UTF-16 encoding and as emoji's are above 0xFFFF as well, this regex pattern consider surrogate pairs to identify emojis.

Adrien Parrochia

To remove all possible emojis:

new RegExp('[\u1000-\uFFFF]+', 'g');

May be you should use replace in such way?

reg = str.replace(new RegExp('😊','g'),'');

Try out https://github.com/iLeonidze/emoji.js

emoji's in range of U+1F600 to U+1F64F

you can use this line in your script for sending with Json:

text.replace(/[\u1F60-\u1F64]|[\u2702-\u27B0]|[\u1F68-\u1F6C]|[\u1F30-\u1F70]{\u2600-\u26ff]/g, "");
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!