可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I want to check if a letter is a emoji. I've found some similiar questions on so and found this regex:
private final String emo_regex = "([\\u20a0-\\u32ff\\ud83c\\udc00-\\ud83d\\udeff\\udbb9\\udce5-\\udbb9\\udcee])";
However, when I do the following in a sentence like:
for (int k=0; k
It doesn't add any letters with any emoji. I've also tried with a Matcher
and a Pattern
, but that didn't work either. Is there something wrong with the regex or am I missing something obvious in my code?
This is how I get the letter:
sentence = "Jij staat op 10 ?" String[] letters = sentence.split("");
The last ? should be recognized and added to emoticon
回答1:
It seems like those emojis are two characters long, but with split("")
you are splitting between each single character, thus none of those letters can be the emoji you are looking for.
Instead, you could try splitting between words:
for (String word : sentence.split(" ")) { if (word.matches(emo_regex)) { System.out.println(word); } }
But of course this will miss emojis that are joined to a word, or punctuation.
Alternatively, you could just use a Matcher
to find
any group
in the sentence that matches the regex.
Matcher matcher = Pattern.compile(emo_regex).matcher(sentence); while (matcher.find()) { System.out.println(matcher.group()); }
回答2:
You can use Character class for determining is letter is part of surrogate pair. There some helpful methods to deal with surrogate pairs emoji symbols, for example:
String text = "?"; if (text.length() > 1 && Character.isSurrogatePair(text.charAt(0), text.charAt(1))) { int codePoint = Character.toCodePoint(text.charAt(0), text.charAt(1)); char[] c = Character.toChars(codePoint); }
回答3:
You could use emoji4j library. The following should solve the issue.
String htmlifiedText = EmojiUtils.htmlify(text); // regex to identify html entitities in htmlified text Matcher matcher = htmlEntityPattern.matcher(htmlifiedText); while (matcher.find()) { String emojiCode = matcher.group(); if (isEmoji(emojiCode)) { emojis.add(EmojiUtils.getEmoji(emojiCode).getEmoji()); } }
回答4:
It's worth bearing in mind that Java code can be written in Unicode. So you can just do:
@Test public void containsEmoji_detects_smileys() { assertTrue(containsEmoji("This ? is a smiley ")); assertTrue(containsEmoji("This ? is a different smiley")); assertFalse(containsEmoji("No smiley here")); } private boolean containsEmoji(String s) { String pattern = ".*[??].*"; return s.matches(pattern); }
Although see: Should source code be saved in UTF-8 format for discussion on whether that's a good idea.
You can split a String into Unicode codepoints in Java 8 using String.codePoints()
, which returns an IntStream
. That means you can do something like:
Set emojis = new HashSet(); emojis.add("?".codePointAt(0)); emojis.add("?".codePointAt(0)); String s = "1?34?5"; s.codePoints().forEach( codepoint -> { System.out.println( new String(Character.toChars(codepoint)) + " " + emojis.contains(codepoint)); });
... prints ...
1 false ? true 3 false 4 false ? true 5 false
Of course if you prefer not to have literal unicode chars in your code you can just put numbers in your set:
emojis.add(0x1F601);