I am trying to replace emoji from Arabic tweets using java.
I used this code:
String line = \"اييه تقولي اجل الارسنال تعادل امس بعد ما كان فايز
From the Javadoc for the Pattern class
A Unicode character can also be represented in a regular-expression by using its Hex notation(hexadecimal code point value) directly as described in construct
\x{...}
, for example a supplementary character U+2011F can be specified as\x{2011F}
, instead of two consecutive Unicode escape sequences of the surrogate pair\uD840\uDD1F
.
This means that the regular expression that you're looking for is ([\x{1F601}-\x{1F64F}])
. Of course, when you write this as a Java String
literal, you must escape the backslashes.
Pattern unicodeOutliers = Pattern.compile("([\\x{1F601}-\\x{1F64F}])");
Note that the construct \x{...}
is only available from Java 7.