问题
In my app, I have characters that are followed by their "modifier diacritical marks" (e.g. "oˆ", where the "ˆ" is unicode 0x02c6) that I want to convert into fully precomposed characters (e.g. "ô" - unicode 0x00f4). I tried using the NSString method precomposedStringWithCanonicalMapping, but after several hours of beating my head against the wall trying to figure out why it wasn't working, I discovered that it only converts "combining diacritical marks" (http://www.unicode.org/charts/PDF/U0300.pdf) into precomposed characters. Ok, so all I need to do is convert all of my "modifier diacritical marks" into "combining diacritical marks", then perform a precomposedStringWithCanonicalMapping on the resulting string and I'm done. This does work, but I wonder if there's a less tedious/error prone way to do this? Here's my NSString category method that seems to fix most of the characters-
- (instancetype)combineDiacritics
{
static NSDictionary<NSNumber *, NSNumber *> *sDiacriticalSubstDict; //unichar of diacritic -> unichar of combining diacritic
static dispatch_once_t onceToken;
dispatch_once(&onceToken, ^{
//http://www.unicode.org/charts/PDF/U0300.pdf
sDiacriticalSubstDict = @{ @(0x02cb) : @(0x0300), @(0x00b4) : @(0x0301), @(0x02c6) : @(0x0302), @(0x02dc) : @(0x0303), @(0x02c9) : @(0x0304), //Grave, Acute, Circumflex, Tilde, Macron
@(0x00af) : @(0x0305), @(0x02d8) : @(0x0306), @(0x02d9) : @(0x0307), @(0x00a8) : @(0x0308), @(0x02c0) : @(0x0309), //Overline, Breve, Dot above, Diaeresis
@(0x00b0) : @(0x030a), @(0x02da) : @(0x030b), @(0x02c7) : @(0x030c), @(0x02c8) : @(0x030d), @(0x02bb) : @(0x0312), //Ring above, Double Acute, Caron, Vertical line above, Cedilla above
@(0x02bc) : @(0x0313), @(0x02bd) : @(0x0314), @(0x02b2) : @(0x0321), @(0x02d4) : @(0x0323), @(0x02b1) : @(0x0324), //Comma above, Reversed comma above, Palatalized hook below, Dot below, Diaeresis below
@(0x00b8) : @(0x0327), @(0x02db) : @(0x0328), @(0x02cc) : @(0x0329), @(0x02b7) : @(0x032b), @(0x02cd) : @(0x0331), //Cedilla, Ogonek, Vert line below, Inverted double arch below, Macron below
};
});
NSMutableString* __block buffer = [NSMutableString stringWithCapacity:self.length];
[self enumerateSubstringsInRange:NSMakeRange(0, self.length) options:NSStringEnumerationByComposedCharacterSequences usingBlock: ^(NSString* substring, NSRange substringRange, NSRange enclosingRange, BOOL* stop) {
NSString *newString = nil;
if (substring.length == 1) //The diacriticals are all Unicode BMP.
{
unichar uniChar = [substring characterAtIndex:0];
unichar newUniChar = [sDiacriticalSubstDict[@(uniChar)] integerValue];
if (newUniChar != 0)
{
NSLog(@"Unichar %04x => %04x", uniChar, newUniChar);
newString = [NSString stringWithCharacters:&newUniChar length:1];
}
}
if (newString)
[buffer appendString:newString];
else
[buffer appendString:substring];
}];
NSString *precomposedStr = [buffer precomposedStringWithCanonicalMapping];
return precomposedStr;
}
Does anyone know of more built-in way to make this conversion?
回答1:
There is no built-in way to do this conversion because characters in the Spacing Modifier Letters block (U+02B0..U+02FF) are not intended to be used as diacritical marks. From Section 7.8 of the Unicode Standard:
They are not formally combining marks (gc=Mn or gc=Mc) and do not graphically combine with the base letter that they modify. They are base characters in their own right.
Spacing Clones of Diacritics. Some corporate standards explicitly specify spacing and nonspacing forms of combining diacritical marks, and the Unicode Standard provides matching codes for these interpretations when practical.
If you want to convert them to the combining forms, you will need to build a table (as you are already doing) from the cross references in the Spacing Modifier Letters code chart.
来源:https://stackoverflow.com/questions/35952216/how-to-convert-to-combining-diacritical-marks-on-ios