Java Unicode String length

后端 未结 5 710
孤独总比滥情好
孤独总比滥情好 2020-12-13 03:41

I am trying hard to get the count of unicode string and tried various options. Looks like a small problem but struck in a big way.

Here I am trying to get the length

5条回答
  •  情话喂你
    2020-12-13 04:19

    Found a solution to your problem.

    Based on this SO answer I made a program that uses regex character classes to search for letters that may have optional modifiers. It splits your string into single (combined if necessary) characters and puts them into a list:

    import java.util.*;
    import java.lang.*;
    import java.util.regex.*;
    
    class Main
    {
        public static void main (String[] args)
        {
            String s="குமார்";
            List characters=new ArrayList();
            Pattern pat = Pattern.compile("\\p{L}\\p{M}*");
            Matcher matcher = pat.matcher(s);
            while (matcher.find()) {
                characters.add(matcher.group());            
            }
    
            // Test if we have the right characters and length
            System.out.println(characters);
            System.out.println("String length: " + characters.size());
    
        }
    }
    

    where \\p{L} means a Unicode letter, and \\p{M} means a Unicode mark.

    The output of the snippet is:

    கு
    மா
    ர்
    String length: 3
    

    See https://ideone.com/Apkapn for a working Demo


    EDIT

    I now checked my regex with all valid Tamil letters taken from the tables in http://en.wikipedia.org/wiki/Tamil_script. I found out that with the current regex we do not capture all letters correctly (every letter in the last row in the Grantha compound table is splitted into two letters), so I refined my regex to the following solution:

    Pattern pat = Pattern.compile("\u0B95\u0BCD\u0BB7\\p{M}?|\\p{L}\\p{M}?");
    

    With this Pattern instead of the above one you should be able to split your sentence into every valid Tamil letter (as long as wikipedia's table is complete).

    The code I used for checking is the following one:

    String s = "ஃஅஆஇஈஉஊஎஏஐஒஓஔக்ககாகிகீகுகூகெகேகைகொகோகௌங்ஙஙாஙிஙீஙுஙூஙெஙேஙைஙொஙோஙௌச்சசாசிசீசுசூசெசேசைசொசோசௌஞ்ஞஞாஞிஞீஞுஞூஞெஞேஞைஞொஞோஞௌட்டடாடிடீடுடூடெடேடைடொடோடௌண்ணணாணிணீணுணூணெணேணைணொணோணௌத்ததாதிதீதுதூதெதேதைதொதோதௌந்நநாநிநீநுநூநெநேநைநொநோநௌப்பபாபிபீபுபூபெபேபைபொபோபௌம்மமாமிமீமுமூமெமேமைமொமோமௌய்யயாயியீயுயூயெயேயையொயோயௌர்ரராரிரீருரூரெரேரைரொரோரௌல்லலாலிலீலுலூலெலேலைலொலோலௌவ்வவாவிவீவுவூவெவேவைவொவோவௌழ்ழழாழிழீழுழூழெழேழைழொழோழௌள்ளளாளிளீளுளூளெளேளைளொளோளௌற்றறாறிறீறுறூறெறேறைறொறோறௌன்னனானினீனுனூனெனேனைனொனோனௌஶ்ஶஶாஶிஶீஶுஶூஶெஶேஶைஶொஶோஶௌஜ்ஜஜாஜிஜீஜுஜூஜெஜேஜைஜொஜோஜௌஷ்ஷஷாஷிஷீஷுஷூஷெஷேஷைஷொஷோஷௌஸ்ஸஸாஸிஸீஸுஸூஸெஸேஸைஸொஸோஸௌஹ்ஹஹாஹிஹீஹுஹூஹெஹேஹைஹொஹோஹௌக்ஷ்க்ஷக்ஷாக்ஷிக்ஷீக்ஷுக்ஷூக்ஷெக்ஷேக்ஷைஷொக்ஷோஷௌ";
    List characters = new ArrayList();
    Pattern pat = Pattern.compile("\u0B95\u0BCD\u0BB7\\p{M}?|\\p{L}\\p{M}?");
    Matcher matcher = pat.matcher(s);
    while (matcher.find()) {
        characters.add(matcher.group());
    }
    
    System.out.println(characters);
    System.out.println(characters.size() == 325);
    

提交回复
热议问题