astral-plane

Can MongoDB store and manipulate strings of UTF-8 with code points outside the basic multilingual plane?

让人想犯罪 __ 提交于 2021-01-27 04:20:35
问题 In MongoDB 2.0.6, when attempting to store documents or query documents that contain string fields, where the value of a string include characters outside the BMP, I get a raft of errors like: "Not proper UTF-16: 55357", or "buffer too small" What settings, changes, or recommendations are there to permit storage and query of multi-lingual strings in Mongo, particularly ones that include these characters above 0xFFFF? Thanks. 回答1: There are several issues here: 1) Please be aware that MongoDB

Can MongoDB store and manipulate strings of UTF-8 with code points outside the basic multilingual plane?

有些话、适合烂在心里 提交于 2021-01-27 04:18:10
问题 In MongoDB 2.0.6, when attempting to store documents or query documents that contain string fields, where the value of a string include characters outside the BMP, I get a raft of errors like: "Not proper UTF-16: 55357", or "buffer too small" What settings, changes, or recommendations are there to permit storage and query of multi-lingual strings in Mongo, particularly ones that include these characters above 0xFFFF? Thanks. 回答1: There are several issues here: 1) Please be aware that MongoDB

Regexp in ruby 1.8.7 that will detect a 4-byte Unicode character

ⅰ亾dé卋堺 提交于 2020-01-05 11:06:33
问题 Can anyone tell me how I would write a ruby regexp in ruby 1.8.7 to detect the presence of a 4-byte unicode character (specifically the emoji)? I am trying to handle the fact that mysql does not, by default, allow you to store 4-byte emoji unicode characters, now in use by iOS 5. Thanks! 回答1: This appears to match the first two bytes of the four bytes that represent emoji. This is being run in ruby 1.8.7. str.match(/\360\237/) 回答2: Altering the table might be feasible using a non-blocking

Retrieve Unicode code points > U+FFFF from QChar

£可爱£侵袭症+ 提交于 2019-12-30 08:30:25
问题 I have an application that is supposed to deal with all kinds of characters and at some point display information about them. I use Qt and its inherent Unicode support in QChar, QString etc. Now I need the code point of a QChar in order to look up some data in http://unicode.org/Public/UNIDATA/UnicodeData.txt, but QChar's unicode() method only returns a ushort (unsigned short), which usually is a number from 0 to 65535 (or 0xFFFF). There are characters with code points > 0xFFFF, so how do I

Java regex match characters outside Basic Multilingual Plane

我们两清 提交于 2019-12-30 03:46:12
问题 How can I match characters (with the intention of removing them) from outside the unicode Basic Multilingual Plane in java? 回答1: To remove all non-BMP characters, the following should work: String sanitizedString = inputString.replaceAll("[^\u0000-\uFFFF]", ""); 回答2: Are you looking for specific characters or all characters outside the BMP? If the former, you can use a StringBuilder to construct a string containing code points from the higher planes, and regex will work as expected: String

Unicode characters from charcode in javascript for charcodes > 0xFFFF

懵懂的女人 提交于 2019-12-30 01:00:09
问题 I need to get a string / char from a unicode charcode and finally put it into a DOM TextNode to add into an HTML page using client side JavaScript. Currently, I am doing: String.fromCharCode(parseInt(charcode, 16)); where charcode is a hex string containing the charcode, e.g. "1D400" . The unicode character which should be returned is 𝐀 , but a 퐀 is returned! Characters in the 16 bit range ( 0000 ... FFFF ) are returned as expected. Any explanation and / or proposals for correction? Thanks in

How can I display extended Unicode character in a C# console?

若如初见. 提交于 2019-12-24 11:47:16
问题 I'm trying to display a set of playing cards, which have Unicode values in the 1F0A0 to 1F0DF range. Whenever I try to use chars with more than 4 chars in their code, I get errors. Is it possible to use these characters in this context? I'm using Visual Studio 2012. char AceOfSpades = '\u1F0A0'; immediately upon typing gives me the error "Too many characters in character literal" This still shows up with either of the Unicode or UTF8 encodings. If I try to display '\u1F0A' like above... With

How to escape a character out of Basic Multilingual Plane?

允我心安 提交于 2019-12-24 02:27:35
问题 For characters in Basic Multilingual Plane, we can use '\uxxxx' escape it. For example, you can use /[\u4e00-\u9fff]/ to match a common chinese character(0x4e00-0x9fff is the range of CJK Unified Ideographs). But for characters out of Basic Multilingual Plane, their codes are bigger than 0xffff. So you can't use format '\uxxxx' to escape it, because '\u20000' means character '\u2000' and character '0', not a character which code is 0x20000. How can I escape characters out of Basic

how to render 32bit unicode characters in google v8 (and nodejs)

与世无争的帅哥 提交于 2019-12-19 05:43:52
问题 does anyone have an idea how to render unicode 'astral plane' characters (whose CIDs are beyond 0xffff) in google v8, the javascript vm that drives both google chrome and nodejs? funnily enough, when i give google chrome (it identifies as 11.0.696.71, running on ubuntu 10.4) an html page like this: <script>document.write( "helo" ) document.write( "𡥂 ⿸𠂇子" ); </script> it will correctly render the 'wide' character 𡥂 alongside with the 'narrow' ones, but when i try the equivalent in nodejs

Java reading in character streams with supplementary unicode characters

可紊 提交于 2019-12-10 11:04:27
问题 I'm having trouble reading in supplementary unicode characters using Java. I have a file that potentially contains characters in the supplementary set (anything greater than \uFFFF). When I setup my InputStreamReader to read the file using UTF-8 I would expect the read() method to return a single character for each supplementary character, instead it seems to split on the 16 bit threshold. I saw some other questions about basic unicode character streams, but nothing seems to deal with the