Check if letter is emoji

孤街浪徒 提交于 2019-11-28 10:26:19

It seems like those emojis are two characters long, but with split("") you are splitting between each single character, thus none of those letters can be the emoji you are looking for.

Instead, you could try splitting between words:

for (String word : sentence.split(" ")) {
    if (word.matches(emo_regex)) {
        System.out.println(word);
    }
}

But of course this will miss emojis that are joined to a word, or punctuation.

Alternatively, you could just use a Matcher to find any group in the sentence that matches the regex.

Matcher matcher = Pattern.compile(emo_regex).matcher(sentence);
while (matcher.find()) {
    System.out.println(matcher.group());
}

You could use emoji4j library. The following should solve the issue.

String htmlifiedText = EmojiUtils.htmlify(text);
// regex to identify html entitities in htmlified text
Matcher matcher = htmlEntityPattern.matcher(htmlifiedText);

while (matcher.find()) {
    String emojiCode = matcher.group();
    if (isEmoji(emojiCode)) {

        emojis.add(EmojiUtils.getEmoji(emojiCode).getEmoji());
    }
}

You can use Character class for determining is letter is part of surrogate pair. There some helpful methods to deal with surrogate pairs emoji symbols, for example:

String text = "💩";
if (text.length() > 1 && Character.isSurrogatePair(text.charAt(0), text.charAt(1))) {
    int codePoint = Character.toCodePoint(text.charAt(0), text.charAt(1));
    char[] c = Character.toChars(codePoint);
}

This function I created checks if given String consists of only emojis. in other words if the String contains any character not included in the Regex, it will return false.

private static boolean isEmoji(String message){
    return message.matches("(?:[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83E\uDD00-\uD83E\uDDFF]|" +
            "[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|" +
            "[\u2600-\u26FF]\uFE0F?|[\u2700-\u27BF]\uFE0F?|\u24C2\uFE0F?|" +
            "[\uD83C\uDDE6-\uD83C\uDDFF]{1,2}|" +
            "[\uD83C\uDD70\uD83C\uDD71\uD83C\uDD7E\uD83C\uDD7F\uD83C\uDD8E\uD83C\uDD91-\uD83C\uDD9A]\uFE0F?|" +
            "[\u0023\u002A\u0030-\u0039]\uFE0F?\u20E3|[\u2194-\u2199\u21A9-\u21AA]\uFE0F?|[\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55]\uFE0F?|" +
            "[\u2934\u2935]\uFE0F?|[\u3030\u303D]\uFE0F?|[\u3297\u3299]\uFE0F?|" +
            "[\uD83C\uDE01\uD83C\uDE02\uD83C\uDE1A\uD83C\uDE2F\uD83C\uDE32-\uD83C\uDE3A\uD83C\uDE50\uD83C\uDE51]\uFE0F?|" +
            "[\u203C\u2049]\uFE0F?|[\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE]\uFE0F?|" +
            "[\u00A9\u00AE]\uFE0F?|[\u2122\u2139]\uFE0F?|\uD83C\uDC04\uFE0F?|\uD83C\uDCCF\uFE0F?|" +
            "[\u231A\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA]\uFE0F?)+");
}

Example of implementation:

public static int detectEmojis(String message){
    int len = message.length(), NumEmoji = 0;
    // if the the given String is only emojis.
    if(isEmoji(message)){
        for (int i = 0; i < len; i++) {
            // if the charAt(i) is an emoji by it self -> ++NumEmoji
            if (isEmoji(message.charAt(i)+"")) {
                NumEmoji++;
            } else {
                // maybe the emoji is of size 2 - so lets check.
                if (i < (len - 1)) { // some Emojis are two characters long in java, e.g. a rocket emoji is "\uD83D\uDE80";
                    if (Character.isSurrogatePair(message.charAt(i), message.charAt(i + 1))) {
                        i += 1; //also skip the second character of the emoji
                        NumEmoji++;
                    }
                }
            }
        }
        return NumEmoji;
    }
    return 0;
}

given is a function that runs on a string (of only emojis) and return the number of emojis in it. (with the help of other answers i found here on StackOverFlow).

Try this project simple-emoji-4j

Compatible with Emoji 12.0 (2018.10.15)

Simple with:

EmojiUtils.containsEmoji(str)
slim

It's worth bearing in mind that Java code can be written in Unicode. So you can just do:

@Test
public void containsEmoji_detects_smileys() {
    assertTrue(containsEmoji("This 😂 is a smiley "));
    assertTrue(containsEmoji("This 😄 is a different smiley"));
    assertFalse(containsEmoji("No smiley here"));
}

private boolean containsEmoji(String s) {
    String pattern = ".*[😂😄].*";
    return s.matches(pattern);
}

Although see: Should source code be saved in UTF-8 format for discussion on whether that's a good idea.


You can split a String into Unicode codepoints in Java 8 using String.codePoints(), which returns an IntStream. That means you can do something like:

Set<Integer> emojis = new HashSet<>();
emojis.add("😂".codePointAt(0));
emojis.add("😄".codePointAt(0));
String s = "1😂34😄5";
s.codePoints().forEach( codepoint -> {
    System.out.println(
        new String(Character.toChars(codepoint)) 
        + " " 
        + emojis.contains(codepoint));
});

... prints ...

1 false
😂 true
3 false
4 false
😄 true
5 false

Of course if you prefer not to have literal unicode chars in your code you can just put numbers in your set:

emojis.add(0x1F601);
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!