Replacing Emoji Unicode Range from Arabic Tweets using Java

前端 未结 2 906
萌比男神i
萌比男神i 2021-01-01 05:10

I am trying to replace emoji from Arabic tweets using java.

I used this code:

String line = \"اييه تقولي اجل الارسنال تعادل امس بعد ما كان فايز          


        
2条回答
  •  天涯浪人
    2021-01-01 05:59

    Java 5 and 6

    If you are stuck running your program on Java 5 or 6 JVM, and you want to match characters in the range from U+1F601 to U+1F64F, use surrogate pairs in the character class:

    Pattern emoticons = Pattern.compile("[\uD83D\uDE01-\uD83D\uDE4F]");
    

    This method is valid even in Java 7 and above, since in Sun/Oracle's implementation, if you decompile Pattern.compile() method, the String containing the pattern is converted into an array of code points before compilation.

    Java 7 and above

    1. You can use the construct \x{...} in David Wallace's answer, which is available from Java 7.

    2. Or alternatively, you can also specify the whole Emoticons Unicode block, which spans from code point U+1F600 (instead of U+1F601) to U+1F64F.

      Pattern emoticons = Pattern.compile("\\p{InEmoticons}");
      

      Since Emoticons block support is added in Java 7, this method is also only valid from Java 7.

    3. Although the other methods are preferred, you can specify supplemental characters by specifying the escape in the regex. While there is no reason to do this in the source code, this change in Java 7 corrects the behavior in applications where regex is used for searching, and directly pasting the character is not possible.

      Pattern emoticons = Pattern.compile("[\\uD83D\\uDE01-\\uD83D\\uDE4F]");
      

      /!\ Warning

      Never ever mix the syntax together when you specify a supplemental code point, like:

      • "[\\uD83D\uDE01-\\uD83D\\uDE4F]"

      • "[\uD83D\\uDE01-\\uD83D\\uDE4F]"

      Those will specify to match the code point U+D83D and the range from code point U+DE01 to code point U+1F64F in Oracle's implementation.

    Note

    In Java 5 and 6, Oracle's implementation, the implementation of Pattern.u() doesn't collapse valid regex-escaped surrogate pairs "\\uD83D\\uDE01". As the result, the pattern is interpreted as 2 lone surrogates, which will fail to match anything.

提交回复
热议问题