I have a UITextView and I need to detect if a user enters an emoji character.
I would think that just checking the unicode value of the newest character would suffic
First let's address your "55357 method" – and why it works for many emoji characters.
In Cocoa, an NSString is a collection of unichars, and unichar is just a typealias for unsigned short which is the same as UInt16. Since the maximum value of UInt16 is 0xffff, this rules out quite a few emoji from being able to fit into one unichar, as only two out of the six main Unicode blocks used for emoji fall under this range:
These blocks contain 113 emoji, and an additional 66 emoji that can be represented as a single unichar can be found spread around various other blocks. However, these 179 characters only represent a fraction of the 1126 emoji base characters, the rest of which must be represented by more than one unichar.
Let's analyse your code:
unichar unicodevalue = [text characterAtIndex:0];
What's happening is that you're simply taking the first unichar of the string, and while this works for the previously mentioned 179 characters, it breaks apart when you encounter a UTF-32 character, since NSString converts everything into UTF-16 encoding. The conversion works by substituting the UTF-32 value with surrogate pairs, which means that the NSString now contains two unichars.
And now we're getting to why the number 55357, or 0xd83d, appears for many emoji: when you only look at the first UTF-16 value of a UTF-32 character you get the high surrogate, each of which have a span of 1024 low surrogates. The range for the high surrogate 0xd83d is U+1F400–U+1F7FF, which starts in the middle of the largest emoji block, Miscellaneous Symbols and Pictographs (U+1F300–U+1F5FF), and continues all the way up to Geometric Shapes Extended (U+1F780–U+1F7FF) – containing a total of 563 emoji, and 333 non-emoji characters within this range.
So, an impressive 50% of emoji base characters have the the high surrogate 0xd83d, but these deduction methods still leave 384 emoji characters unhandled, along with giving false positives for at least as many.
I recently answered a somewhat related question with a Swift implementation, and if you want to, you can look at how emoji are detected in this framework, which I created for the purpose of replacing standard emoji with custom images.
Anyhow, what you can do is extract the UTF-32 code point from the characters, which we'll do according to the specification:
- (BOOL)textView:(UITextView *)textView shouldChangeTextInRange:(NSRange)range replacementText:(NSString *)text {
// Get the UTF-16 representation of the text.
unsigned long length = text.length;
unichar buffer[length];
[text getCharacters:buffer];
// Initialize array to hold our UTF-32 values.
NSMutableArray *array = [[NSMutableArray alloc] init];
// Temporary stores for the UTF-32 and UTF-16 values.
UTF32Char utf32 = 0;
UTF16Char h16 = 0, l16 = 0;
for (int i = 0; i < length; i++) {
unichar surrogate = buffer[i];
// High surrogate.
if (0xd800 <= surrogate && surrogate <= 0xd83f) {
h16 = surrogate;
continue;
}
// Low surrogate.
else if (0xdc00 <= surrogate && surrogate <= 0xdfff) {
l16 = surrogate;
// Convert surrogate pair to UTF-32 encoding.
utf32 = ((h16 - 0xd800) << 10) + (l16 - 0xdc00) + 0x10000;
}
// Normal UTF-16.
else {
utf32 = surrogate;
}
// Add UTF-32 value to array.
[array addObject:[NSNumber numberWithUnsignedInteger:utf32]];
}
NSLog(@"%@ contains values:", text);
for (int i = 0; i < array.count; i++) {
UTF32Char character = (UTF32Char)[[array objectAtIndex:i] unsignedIntegerValue];
NSLog(@"\t- U+%x", character);
}
return YES;
}
Typing "