Split JavaScript string into array of codepoints? (taking into account “surrogate pairs” but not “grapheme clusters”)

半世苍凉 提交于 2019-11-28 10:54:25
John Frazer

@bobince's answer has (luckily) become a bit dated; you can now simply use

var chars = Array.from( text )

to obtain a list of single-codepoint strings which does respect astral / 32bit / surrogate Unicode characters.

bobince

In ECMAScript 6 you'll be able to use a string as an iterator to get code points, or you could search a string for /./ug, or you could call getCodePointAt(i) repeatedly.

Unfortunately for..of syntax and regexp flags can't be polyfilled and calling a polyfilled getCodePoint() would be super slow (O(n²)), so we can't realistically use this approach for a while yet.

So doing it the manual way:

String.prototype.toCodePoints= function() {
    chars = [];
    for (var i= 0; i<this.length; i++) {
        var c1= this.charCodeAt(i);
        if (c1>=0xD800 && c1<0xDC00 && i+1<this.length) {
            var c2= this.charCodeAt(i+1);
            if (c2>=0xDC00 && c2<0xE000) {
                chars.push(0x10000 + ((c1-0xD800)<<10) + (c2-0xDC00));
                i++;
                continue;
            }
        }
        chars.push(c1);
    }
    return chars;
}

For the inverse to this see https://stackoverflow.com/a/3759300/18936

Along the lines of @John Frazer's answer, one can use this even succincter form of string iteration:

const chars = [...text]

e.g., with:

const text = 'A\uD835\uDC68B\uD835\uDC69C\uD835\uDC6A'
const chars = [...text] // ["A", "𝑨", "B", "𝑩", "C", "𝑪"]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!