I am trying hard to get the count of unicode string and tried various options. Looks like a small problem but struck in a big way.
Here I am trying to get the length
Have a look at the Normalizer class. There is an explanation of what may be the cause of your problem. In Unicode, you can encode characters in several ways, e.g Á
:
U+00C1 LATIN CAPITAL LETTER A WITH ACUTE
or
U+0041 LATIN CAPITAL LETTER A
U+0301 COMBINING ACUTE ACCENT
You can try to use Normalizer
to convert your string to the composed form and then iterate over the characters.
Edit: Based on the article suggested by @halex above, try this in Java:
String str = new String("குமார்");
ArrayList characters = new ArrayList();
str = Normalizer.normalize(str, Form.NFC);
StringBuilder charBuffer = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
int codePoint = str.codePointAt(i);
int category = Character.getType(codePoint);
if (charBuffer.length() > 0
&& category != Character.NON_SPACING_MARK
&& category != Character.COMBINING_SPACING_MARK
&& category != Character.CONTROL
&& category != Character.OTHER_SYMBOL) {
characters.add(charBuffer.toString());
charBuffer.delete(0, charBuffer.length());
}
charBuffer.appendCodePoint(codePoint);
}
if (charBuffer.length() > 0) {
characters.add(charBuffer.toString());
}
System.out.println(characters);
The result I get is [கு, மா, ர்]
. If it doesn't work for all your strings, try fiddeling with other Unicode character categories in the if
block.