Get grapheme character count in javascript strings?

只谈情不闲聊 提交于 2019-12-07 11:30:22

问题


I'm trying to get the length of a javascript string in user-visible graphemes, ie ignoring combining characters (and surrogate pairs?). Is this possible, and if so, how would I go about it?

We're using the dojo toolkit on our project, but any general javascript solution would be great.


回答1:


For the combining characters, look at the Derived Combining Class that lists all combining characters (among others). Since you're just interested in counting, you could just nuke them out -- leaves you with a slightly closer estimation.

In the post linked to by Angus, JavaScript strings outside of the BMP shows code to deal with surrogates. But the code actually does the contrary of what you want -- it splits the 0x10000+ codepoints into two codepoints. As far as JS is concerned it's one codepoint -- albeit a truncated one. Who cares? You're counting them, not displaying...

BUT, there's another category of codepoints you might want to deal with too, the non-printable characters. Anything under 0x20 of course, but there's plenty of others -- look at the 0x2000 range for instance. These are not visible either and should not be included in your count.




回答2:


Here is a pure JavaScript library that does just that:

https://github.com/orling/grapheme-splitter

It implements the Unicode UAX-29 standard in all its edge cases that you're likely to miss in a home-brew solution, like non-Latin diacritics, Hangul (Korean) jamo characters, emoji, multiple combining marks, etc.




回答3:


This open-source CoffeeScript implementation seems to work decently enough: https://github.com/devongovett/grapheme-breaker (if only it wasn't CS 😜)



来源:https://stackoverflow.com/questions/10287887/get-grapheme-character-count-in-javascript-strings

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!