Java Unicode String length

后端未结

关注

 5  711

孤独总比滥情好 2020-12-13 03:41

I am trying hard to get the count of unicode string and tried various options. Looks like a small problem but struck in a big way.

Here I am trying to get the length

5条回答

一生所求 (楼主)

2020-12-13 04:19

Have a look at the Normalizer class. There is an explanation of what may be the cause of your problem. In Unicode, you can encode characters in several ways, e.g Á:

  U+00C1    LATIN CAPITAL LETTER A WITH ACUTE

  U+0041    LATIN CAPITAL LETTER A
  U+0301    COMBINING ACUTE ACCENT

You can try to use Normalizer to convert your string to the composed form and then iterate over the characters.

Edit: Based on the article suggested by @halex above, try this in Java:

    String str = new String("குமார்");

    ArrayList characters = new ArrayList();
    str = Normalizer.normalize(str, Form.NFC);
    StringBuilder charBuffer = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
        int codePoint = str.codePointAt(i);
        int category = Character.getType(codePoint);
        if (charBuffer.length() > 0
                && category != Character.NON_SPACING_MARK
                && category != Character.COMBINING_SPACING_MARK
                && category != Character.CONTROL
                && category != Character.OTHER_SYMBOL) {
            characters.add(charBuffer.toString());
            charBuffer.delete(0, charBuffer.length());
        }
        charBuffer.appendCodePoint(codePoint);
    }
    if (charBuffer.length() > 0) {
        characters.add(charBuffer.toString());
    }
    System.out.println(characters);

The result I get is [கு, மா, ர்]. If it doesn't work for all your strings, try fiddeling with other Unicode character categories in the if block.

0 讨论(0)

查看其它5个回答