Remove ✅,

前端 未结 7 1776
离开以前
离开以前 2020-11-28 20:03

I have some strings with all kinds of different emojis/images/signs in them.

Not all the strings are in English -- some of them are in other non-Latin languages, for

相关标签:
7条回答
  • 2020-11-28 20:40

    I gave some examples below, and thought that Latin is enough, but...

    Is there a way to remove all these signs from the input string and keeping only the letters & punctuation in the different languages?

    After editing, developed a new solution, using the Character.getType method, and that appears to be the best shot at this.

    package zmarcos.emoji;
    
    import java.util.Arrays;
    import java.util.HashSet;
    import java.util.Set;
    
    public class TestEmoji {
    
        public static void main(String[] args) {
            String[] arr = {"Remove ✅,                                                                     
    0 讨论(0)
  • 2020-11-28 20:41

    Use a jQuery plugin called RM-Emoji. Here's how it works:

    $('#text').remove('emoji').fast()
    

    This is the fast mode that may miss some emojis as it uses heuristic algorithms for finding emojis in text. Use the .full() method to scan entire string and remove all emojis guaranteed.

    0 讨论(0)
  • 2020-11-28 20:43

    Instead of blacklisting some elements, how about creating a whitelist of the characters you do wish to keep? This way you don't need to worry about every new emoji being added.

    String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
    String emotionless = aString.replaceAll(characterFilter,"");
    

    So:

    • [\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s] is a range representing all numeric (\\p{N}), letter (\\p{L}), mark (\\p{M}), punctuation (\\p{P}), whitespace/separator (\\p{Z}), other formatting (\\p{Cf}) and other characters above U+FFFF in Unicode (\\p{Cs}), and newline (\\s) characters. \\p{L} specifically includes the characters from other alphabets such as Cyrillic, Latin, Kanji, etc.
    • The ^ in the regex character set negates the match.

    Example:

    String str = "hello world _# 皆さん、こんにちは! 私はジョンと申します。                                                                    
    0 讨论(0)
  • 2020-11-28 20:46

    Based on Full Emoji List, v11.0 you have 1644 different Unicode code points to remove. For example is on this list as U+2705.

    Having the full list of emojis you need to filter them out using code points. Iterating over single char or byte won't work as single code point can span multiple bytes. Because Java uses UTF-16 emojis will usually take two chars.

    String input = "ab✅cd";
    for (int i = 0; i < input.length();) {
      int cp = input.codePointAt(i);
      // filter out if matches
      i += Character.charCount(cp); 
    }
    

    Mapping from Unicode code point U+2705 to Java int is straightforward:

    int viSign = 0x2705;
    

    or since Java supports Unicode Strings:

    int viSign = "✅".codePointAt(0);
    
    0 讨论(0)
  • 2020-11-28 20:47

    ICU4J is your friend.

    UCharacter.hasBinaryProperty(UProperty.EMOJI);
    

    Remember to keep your version of icu4j up to date and note this will only filter out official Unicode emoji, not symbol characters. Combine with filtering out other character types as desired.

    More information: http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UProperty.html#EMOJI

    0 讨论(0)
  • 2020-11-28 20:57

    Try this project simple-emoji-4j

    Compatible with Emoji 12.0 (2018.10.15)

    Simple with:

    EmojiUtils.removeEmoji(str)
    
    0 讨论(0)
提交回复
热议问题