发表新帖

发表新帖

Unicode string with diacritics split by chars

后端未结

关注

 5  2308

轻奢々 2020-12-06 01:49

I have this Unicode string: Ааа́Ббб́Ввв́ГгҐґДд

And I want to it split by chars. Right now if I try to loop truth all chars I get something like this:<

5条回答

臣服心动 (楼主)

2020-12-06 02:26
A little update on this.

As ES6 came by, there are new string methods and ways of dealing with strings. There are solutions for two problems present in this.

1) Emoji and surrogate pairs

Emoji and other Unicode characters that fall above the Basic Multilingual Plane (BMP) (Unicode "code points" in the range 0x0000 - 0xFFFF) can be worked out as the strings in ES6 adhere to the iterator protocol, so you can do like this:
```
let textWithEmoji = '\ud83d\udc0e\ud83d\udc71\u2764'; //horse, happy face and heart
[...textWithEmoji].length //3
for (char of textWithEmoji) { console.log(char) } //will log 3 chars
```
2) Diacritics

A harder problem to solve, as you start to work with "grapheme clusters" (a character and it's diacritics). In ES6 there is a method that simplify working with this, but it's still hard to work. The String.prototype.normalize method eases the work, but as Mathias Bynens puts:

(A) code points with multiple combining marks applied to them always result in a single visual glyph, but may not have a normalized form, in which case normalization doesn’t help.

More insight can be found here:

https://ponyfoo.com/articles/es6-strings-and-unicode-in-depth https://mathiasbynens.be/notes/javascript-unicode
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题