I am trying hard to get the count of unicode string and tried various options. Looks like a small problem but struck in a big way.
Here I am trying to get the length
This turns out to be really ugly.... I have debugged your string and it contains following characters (and their hex position):
க 0x0b95
ு 0x0bc1
ம 0x0bae
ா 0x0bbe
ர 0x0bb0
் 0x0bcd
So tamil language obviously use diacritics-like sequences to get all characters which unfortunately count as separate entities.
This is not a problem with UTF-8 / UTF-16 as erronously claimed by other answers, it is inherent in the Unicode encoding of the Tamil language.
The suggested Normalizer does not work, it seems that tamil has been designed by Unicode "experts" to explicitly use combination sequences which cannot be normalized. Aargh.
My next idea is not to count characters, but glyphs, the visual representations of characters.
String str1 = new String(Normalizer.normalize("குமார்", Normalizer.Form.NFC ));
Font display = new Font("SansSerif",Font.PLAIN,12);
GlyphVector vec = display.createGlyphVector(new FontRenderContext(new AffineTransform(),false, false),str1);
System.out.println(vec.getNumGlyphs());
for (int i=0; i
The result:
க b95 [x=0.0,y=-6.0,w=7.0,h=6.0]
ு bc1 [x=8.0,y=-6.0,w=7.0,h=4.0]
ம bae [x=17.0,y=-6.0,w=6.0,h=6.0]
ா bbe [x=23.0,y=-6.0,w=5.0,h=6.0]
ர bb0 [x=30.0,y=-6.0,w=4.0,h=8.0]
் bcd [x=31.0,y=-9.0,w=1.0,h=2.0]
As the glyphs are intersecting, you need to use Java character type functions like in the other solution.
SOLUTION:
I am using this link: http://www.venkatarangan.com/blog/content/binary/Counting%20Letters%20in%20an%20Unicode%20String.pdf
public static int getTamilStringLength(String tamil) {
int dependentCharacterLength = 0;
for (int index = 0; index < tamil.length(); index++) {
char code = tamil.charAt(index);
if (code == 0xB82)
dependentCharacterLength++;
else if (code >= 0x0BBE && code <= 0x0BC8)
dependentCharacterLength++;
else if (code >= 0x0BCA && code <= 0x0BD7)
dependentCharacterLength++;
}
return tamil.length() - dependentCharacterLength;
}
You need to exclude the combination characters and count them accordingly.