How do I remove emoji characters from a string?

家住魔仙堡 提交于 2019-11-27 04:38:55

问题


I've got a text input from a mobile device. It contains emoji. In C#, I have the text as

Text 🍫🌐 text

Simply put, I want the output text to be

Text text

I'm trying to just remove all such emojis from the text with rejex.. except, I'm not sure how to convert that emoji into it's unicode sequence.. How do I do that?

edit:

I'm trying to save the user input into mysql. It looks like mysql UTF8 doesn't really support unicode characters and the right way to do it would be by changing the schema but I don't think that is an option for me. So I'm trying to just remove all the emoji characters before saving it in the database.

This is my schema for the relevant column:

I'm using Nhibernate as my ORM and the insert query generated looks like this:

Insert into `Content` (ContentTypeId, Comments, DateCreated) 
values (?p0, ?p1, ?p2);
?p0 = 4 [Type: Int32 (0)]. ?p1 = 'Text 🍫🌐 text' [Type: String (20)], ?p2 = 19/01/2015 10:38:23 [Type: DateTime (0)]

When I copy this query from logs and run it on mysql directly, I get this error:

1 warning(s): 1366 Incorrect string value: '\xF0\x9F\x98\x80 t...' for column 'Comments' at row 1   0.000 sec

Also, I've tried to convert it into encoding bytes and it doesn't really work..


回答1:


Assuming you just want to remove all non-BMP characters, i.e. anything with a Unicode code point of U+10000 and higher, you can use a regex to remove any UTF-16 surrogate code units from the string. For example:

using System;
using System.Text.RegularExpressions;

class Test
{
    static void Main(string[] args)
    {
        string text = "x\U0001F310y";
        Console.WriteLine(text.Length); // 4
        string result = Regex.Replace(text, @"\p{Cs}", "");
        Console.WriteLine(result); // 2
    }
}

Here "Cs" is the Unicode category for "surrogate".

It appears that Regex works based on UTF-16 code units rather than Unicode code points, otherwise you'd need a different approach.

Note that there are non-BMP characters other than emoji, but I suspect you'll find they'll have the same problem when you try to store them.



来源:https://stackoverflow.com/questions/28023682/how-do-i-remove-emoji-characters-from-a-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!