Count number of characters present in foreign language

后端 未结 2 1649
孤独总比滥情好
孤独总比滥情好 2021-01-05 07:41

Is there any optimal way to implement character count for non English letters? For example, if we take the word \"Mother\" in English, it is a 6 letter wor

2条回答
  •  我在风中等你
    2021-01-05 07:51

    You can ignore combining marks in the count calculation with this function:

    function charCount( str ) {
        var re = /[\u0300-\u036f\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f\u0b82\u0b83\u0bbe\u0bbf\u0bc0-\u0bc2\u0bc6-\u0bc8\u0bca-\u0bcd\u0bd7]/g
        return str.replace( re, "").length;
    }
    
    console.log(charCount('மதர்'))// 3
    
    //More tests on random Tamil text:
    //Paint the text character by character to verify, for instance 'யெ' is a single character, not 2
    
    console.log(charCount("மெய்யெழுத்துக்கள்")); //9
    console.log(charCount("ஒவ்வொன்றுடனும்")); //8
    console.log(charCount("தமிழ்")); //3
    console.log(charCount("வருகின்றனர்.")); //8
    console.log(charCount("எழுதப்படும்")); //7
    

    The Tamil signs and marks are not composed into single characters with their target character in unicode, so normalization wouldn't help. I have added all the Tamil combining marks or signs manually to the regex, but it also includes the ranges for normal combining marks, so charCount("ä") is 1 regardless of normalization form.

提交回复
热议问题