Split JavaScript string into array of codepoints? (taking into account “surrogate pairs” but not “grapheme clusters”)

后端 未结 4 1150
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-10 12:19

Splitting a JavaScript string into \"characters\" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode).

JavaScr

4条回答
  •  北海茫月
    2020-12-10 12:35

    In ECMAScript 6 you'll be able to use a string as an iterator to get code points, or you could search a string for /./ug, or you could call getCodePointAt(i) repeatedly.

    Unfortunately for..of syntax and regexp flags can't be polyfilled and calling a polyfilled getCodePoint() would be super slow (O(n²)), so we can't realistically use this approach for a while yet.

    So doing it the manual way:

    String.prototype.toCodePoints= function() {
        chars = [];
        for (var i= 0; i=0xD800 && c1<0xDC00 && i+1=0xDC00 && c2<0xE000) {
                    chars.push(0x10000 + ((c1-0xD800)<<10) + (c2-0xDC00));
                    i++;
                    continue;
                }
            }
            chars.push(c1);
        }
        return chars;
    }
    

    For the inverse to this see https://stackoverflow.com/a/3759300/18936

提交回复
热议问题