Java Unicode String length

后端 未结 5 704
孤独总比滥情好
孤独总比滥情好 2020-12-13 03:41

I am trying hard to get the count of unicode string and tried various options. Looks like a small problem but struck in a big way.

Here I am trying to get the length

5条回答
  •  一生所求
    2020-12-13 04:19

    Have a look at the Normalizer class. There is an explanation of what may be the cause of your problem. In Unicode, you can encode characters in several ways, e.g Á:

      U+00C1    LATIN CAPITAL LETTER A WITH ACUTE
    

    or

      U+0041    LATIN CAPITAL LETTER A
      U+0301    COMBINING ACUTE ACCENT
    

    You can try to use Normalizer to convert your string to the composed form and then iterate over the characters.


    Edit: Based on the article suggested by @halex above, try this in Java:

        String str = new String("குமார்");
    
        ArrayList characters = new ArrayList();
        str = Normalizer.normalize(str, Form.NFC);
        StringBuilder charBuffer = new StringBuilder();
        for (int i = 0; i < str.length(); i++) {
            int codePoint = str.codePointAt(i);
            int category = Character.getType(codePoint);
            if (charBuffer.length() > 0
                    && category != Character.NON_SPACING_MARK
                    && category != Character.COMBINING_SPACING_MARK
                    && category != Character.CONTROL
                    && category != Character.OTHER_SYMBOL) {
                characters.add(charBuffer.toString());
                charBuffer.delete(0, charBuffer.length());
            }
            charBuffer.appendCodePoint(codePoint);
        }
        if (charBuffer.length() > 0) {
            characters.add(charBuffer.toString());
        }
        System.out.println(characters);
    

    The result I get is [கு, மா, ர்]. If it doesn't work for all your strings, try fiddeling with other Unicode character categories in the if block.

提交回复
热议问题