Split JavaScript string into array of codepoints? (taking into account “surrogate pairs” but not “grapheme clusters”)

折月煮酒 提交于 2019-12-28 16:45:33

问题


Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode).

JavaScript natively treats characters as 16-bit entities (UCS-2 or UTF-16) but this does not allow for Unicode characters outside the BMP (Basic Multilingual Plane).

To deal with Unicode characters beyond the BMP, JavaScript must take into account "surrogate pairs", which it does not do natively.

I'm looking for how to split a js string by codepoint, whether the codepoints require one or two JavaScript "characters" (code units).

Depending on your needs, splitting by codepoint might not be enough, and you might want to split by "grapheme cluster", where a cluster is a base codepoint followed by all its non-spacing modifier codepoints, such as combining accents and diacritics.

For the purposes of this question I do not require splitting by grapheme cluster.


回答1:


@bobince's answer has (luckily) become a bit dated; you can now simply use

var chars = Array.from( text )

to obtain a list of single-codepoint strings which does respect astral / 32bit / surrogate Unicode characters.




回答2:


In ECMAScript 6 you'll be able to use a string as an iterator to get code points, or you could search a string for /./ug, or you could call getCodePointAt(i) repeatedly.

Unfortunately for..of syntax and regexp flags can't be polyfilled and calling a polyfilled getCodePoint() would be super slow (O(n²)), so we can't realistically use this approach for a while yet.

So doing it the manual way:

String.prototype.toCodePoints= function() {
    chars = [];
    for (var i= 0; i<this.length; i++) {
        var c1= this.charCodeAt(i);
        if (c1>=0xD800 && c1<0xDC00 && i+1<this.length) {
            var c2= this.charCodeAt(i+1);
            if (c2>=0xDC00 && c2<0xE000) {
                chars.push(0x10000 + ((c1-0xD800)<<10) + (c2-0xDC00));
                i++;
                continue;
            }
        }
        chars.push(c1);
    }
    return chars;
}

For the inverse to this see https://stackoverflow.com/a/3759300/18936




回答3:


Along the lines of @John Frazer's answer, one can use this even succincter form of string iteration:

const chars = [...text]

e.g., with:

const text = 'A\uD835\uDC68B\uD835\uDC69C\uD835\uDC6A'
const chars = [...text] // ["A", "𝑨", "B", "𝑩", "C", "𝑪"]


来源:https://stackoverflow.com/questions/21397316/split-javascript-string-into-array-of-codepoints-taking-into-account-surrogat

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!