Java Unicode String length

后端 未结 5 707
孤独总比滥情好
孤独总比滥情好 2020-12-13 03:41

I am trying hard to get the count of unicode string and tried various options. Looks like a small problem but struck in a big way.

Here I am trying to get the length

5条回答
  •  旧巷少年郎
    2020-12-13 04:17

    This turns out to be really ugly.... I have debugged your string and it contains following characters (and their hex position):

    க 0x0b95
    ு 0x0bc1
    ம 0x0bae
    ா 0x0bbe
    ர 0x0bb0
    ் 0x0bcd

    So tamil language obviously use diacritics-like sequences to get all characters which unfortunately count as separate entities.

    This is not a problem with UTF-8 / UTF-16 as erronously claimed by other answers, it is inherent in the Unicode encoding of the Tamil language.

    The suggested Normalizer does not work, it seems that tamil has been designed by Unicode "experts" to explicitly use combination sequences which cannot be normalized. Aargh.

    My next idea is not to count characters, but glyphs, the visual representations of characters.

    String str1 = new String(Normalizer.normalize("குமார்", Normalizer.Form.NFC ));
    
    Font display = new Font("SansSerif",Font.PLAIN,12);
    GlyphVector vec = display.createGlyphVector(new FontRenderContext(new AffineTransform(),false, false),str1);
    
    System.out.println(vec.getNumGlyphs());
    for (int i=0; i

    The result:

    க b95 [x=0.0,y=-6.0,w=7.0,h=6.0]
    ு bc1 [x=8.0,y=-6.0,w=7.0,h=4.0]
    ம bae [x=17.0,y=-6.0,w=6.0,h=6.0]
    ா bbe [x=23.0,y=-6.0,w=5.0,h=6.0]
    ர bb0 [x=30.0,y=-6.0,w=4.0,h=8.0]
    ் bcd [x=31.0,y=-9.0,w=1.0,h=2.0]

    As the glyphs are intersecting, you need to use Java character type functions like in the other solution.

    SOLUTION:

    I am using this link: http://www.venkatarangan.com/blog/content/binary/Counting%20Letters%20in%20an%20Unicode%20String.pdf

    public static int getTamilStringLength(String tamil) {
        int dependentCharacterLength = 0;
        for (int index = 0; index < tamil.length(); index++) {
            char code = tamil.charAt(index);
            if (code == 0xB82)
                dependentCharacterLength++;
            else if (code >= 0x0BBE && code <= 0x0BC8)
                dependentCharacterLength++;
            else if (code >= 0x0BCA && code <= 0x0BD7)
                dependentCharacterLength++;
        }
        return tamil.length() - dependentCharacterLength;
      }
    

    You need to exclude the combination characters and count them accordingly.

提交回复
热议问题