Count number of characters present in foreign language

后端未结

关注

 2  1649

孤独总比滥情好 2021-01-05 07:41

Is there any optimal way to implement character count for non English letters? For example, if we take the word \"Mother\" in English, it is a 6 letter wor

2条回答

我在风中等你 (楼主)

2021-01-05 07:51

You can ignore combining marks in the count calculation with this function:

function charCount( str ) {
    var re = /[\u0300-\u036f\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f\u0b82\u0b83\u0bbe\u0bbf\u0bc0-\u0bc2\u0bc6-\u0bc8\u0bca-\u0bcd\u0bd7]/g
    return str.replace( re, "").length;
}

console.log(charCount('மதர்'))// 3

//More tests on random Tamil text:
//Paint the text character by character to verify, for instance 'யெ' is a single character, not 2

console.log(charCount("மெய்யெழுத்துக்கள்")); //9
console.log(charCount("ஒவ்வொன்றுடனும்")); //8
console.log(charCount("தமிழ்")); //3
console.log(charCount("வருகின்றனர்.")); //8
console.log(charCount("எழுதப்படும்")); //7

The Tamil signs and marks are not composed into single characters with their target character in unicode, so normalization wouldn't help. I have added all the Tamil combining marks or signs manually to the regex, but it also includes the ranges for normal combining marks, so charCount("ä") is 1 regardless of normalization form.

0 讨论(0)

查看其它2个回答