How to escape a character out of Basic Multilingual Plane?

允我心安 提交于 2019-12-24 02:27:35

问题


For characters in Basic Multilingual Plane, we can use '\uxxxx' escape it. For example, you can use /[\u4e00-\u9fff]/ to match a common chinese character(0x4e00-0x9fff is the range of CJK Unified Ideographs).

But for characters out of Basic Multilingual Plane, their codes are bigger than 0xffff. So you can't use format '\uxxxx' to escape it, because '\u20000' means character '\u2000' and character '0', not a character which code is 0x20000.

How can I escape characters out of Basic Multilingual Plane? Use those characters directly is not a good idea, because they can't show in most fonts.


回答1:


You can use a pair of escaped surrogate code points, as described in @duskwuff’s answer. You can use my Full Unicode input utility to get the notations (button “Show \u”), or use the Fileformat.info character search to find them out (item “C/C++/Java source code”, because JavaScript uses the same notation here).

Alternatively, you can enter the characters directly: “You can enter non-BMP characters as such into string literals in your JavaScript code,whether in a separate file or as embedded in HTML. Naturally, you need suitable Unicode support in the editor you use. But JavaScript implementations need not support non-BMP characters in program source. They may, and modern browser implementations generally do.” (Going Global with JavaScript and Globalize.js, p. 177) There are some caveats like properly declaring the character encoding.

Font support is a different issue, but when working with characters, you generally want to see them at some point anyway, at least in testing. So you more or less need some font(s) that cover the characters. The Fileformat.info pages also contain links to browser support info, such as (U+20000) Font Support – a good starting point, though not quite complete. For example, U+20000 '𠀀' is also supported in SimSun-ExtB




回答2:


Characters outside the BMP are not recognized directly by Javascript -- they're represented internally as UTF-16 surrogate pairs. For instance, the character you mentioned, U+20000 (currently allocated to "CJK Unified Ideographs Ext. B") is represented as the surrogate pair U+D840 U+DC00. As a Javascript string, this would simply be "\u2840\uDC00". (Note that s.length is 2 for this string, even though it displays as a single character.)

Wikipedia has details on the encoding scheme used.




回答3:


Interesting problem.

Now that we have ES6, we can do this:

let newSpeak = '\u{1F4A9}'

Note that internally it's still UTF-16 with surrogate pairs:

newSpeak.length === 2 // "wrong"
[...newSpeak].length === 1
newSpeak === '\uD83D\uDCA9'

Unicode is huge.

Also, it's not just the literals:

newSpeak.charCodeAt(0) === 0xD83D // "wrong"
newSpeak.codePointAt(0) === 0x1F4A9

String.fromCharCode(0x1F4A9) !== newSpeak
String.fromCodePoint(0x1F4A9) === newSpeak

for (let i = 0; i < newSpeak.length; i++) console.log(newSpeak[i]) // "wrong"
for (let c of newSpeak) console.log(c)

[...'🏃🚚'].map(c => `__${c}`).join('') === "__🏃__🚚"

I � handling Unicode.



来源:https://stackoverflow.com/questions/13204412/how-to-escape-a-character-out-of-basic-multilingual-plane

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!