Is there any optimal way to implement character count for non English letters? For example, if we take the word \"Mother\" in English, it is a 6 letter wor
You can ignore combining marks in the count calculation with this function:
function charCount( str ) {
var re = /[\u0300-\u036f\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f\u0b82\u0b83\u0bbe\u0bbf\u0bc0-\u0bc2\u0bc6-\u0bc8\u0bca-\u0bcd\u0bd7]/g
return str.replace( re, "").length;
}
console.log(charCount('மதர்'))// 3
//More tests on random Tamil text:
//Paint the text character by character to verify, for instance 'யெ' is a single character, not 2
console.log(charCount("மெய்யெழுத்துக்கள்")); //9
console.log(charCount("ஒவ்வொன்றுடனும்")); //8
console.log(charCount("தமிழ்")); //3
console.log(charCount("வருகின்றனர்.")); //8
console.log(charCount("எழுதப்படும்")); //7
The Tamil signs and marks are not composed into single characters with their target character in unicode, so normalization wouldn't help. I have added all the Tamil combining marks or signs manually
to the regex, but it also includes the ranges for normal combining marks, so charCount("ä") is 1 regardless of normalization form.