astral-plane

Warning raised by inserting 4-byte unicode to mysql

て烟熏妆下的殇ゞ 提交于 2019-11-27 09:30:43
Look at the following: /home/kinka/workspace/py/tutorial/tutorial/pipelines.py:33: Warning: Incorrect string value: '\xF0\x9F\x91\x8A\xF0\x9F...' for column 't_content' at row 1 n = self.cursor.execute(self.sql, (item['topic'], item['url'], item['content'])) The string '\xF0\x9F\x91\x8A , actually is a 4-byte unicode: u'\U0001f62a' . The mysql's character-set is utf-8 but inserting 4-byte unicode it will truncate the inserted string. I googled for such a problem and found that mysql under 5.5.3 don't support 4-byte unicode, and unfortunately mine is 5.5.224. I don't want to upgrade the mysql

Java charAt used with characters that have two code units

馋奶兔 提交于 2019-11-27 04:37:16
问题 From Core Java , vol. 1, 9th ed., p. 69: The character ℤ requires two code units in the UTF-16 encoding. Calling String sentence = "ℤ is the set of integers"; // for clarity; not in book char ch = sentence.charAt(1) doesn't return a space but the second code unit of ℤ. But it seems that sentence.charAt(1) does return a space. For example, the if statement in the following code evaluates to true . String sentence = "ℤ is the set of integers"; if (sentence.charAt(1) == ' ') System.out.println(

How would you get an array of Unicode code points from a .NET String?

不想你离开。 提交于 2019-11-27 01:50:42
问题 I have a list of character range restrictions that I need to check a string against, but the char type in .NET is UTF-16 and therefore some characters become wacky (surrogate) pairs instead. Thus when enumerating all the char 's in a string , I don't get the 32-bit Unicode code points and some comparisons with high values fail. I understand Unicode well enough that I could parse the bytes myself if necessary, but I'm looking for a C#/.NET Framework BCL solution. So ... How would you convert a

What are the most common non-BMP Unicode characters in actual use? [closed]

若如初见. 提交于 2019-11-26 19:40:51
In your experience which Unicode characters, codepoints, ranges outside the BMP (Basic Multilingual Plane) are the most common so far? These are the ones which require 4 bytes in UTF-8 or surrogates in UTF-16. I would've expected the answer to be Chinese and Japanese characters used in names but not included in the most widespread CJK multibyte character sets, but on the project I do most work on, the English Wiktionary, we have found that the Gothic alphabet is far more common so far. UPDATE I've written a couple of software tools to scan entire Wikipedias for non-BMP characters and found to

Mysql server does not support 4-byte encoded utf8 characters

柔情痞子 提交于 2019-11-26 18:32:46
问题 I've received a server error running a Data transfer component from Sql Server to MySql db. The error message reads as follows: [MySql][ODBC 5.1 Driver][mysqld-5.0.67-community-nt-log]Server does not support 4-byte encoded UTF8 characters. The source Sql Server table contains nvarchar columns, the target MySql table contains varchar columns. Can anybody shed some light on this problem? 回答1: If you need MySQL to support 4-byte UTF-8 characters (which is normally considered part of UTF-8), you

JavaScript strings outside of the BMP

耗尽温柔 提交于 2019-11-26 15:22:51
BMP being Basic Multilingual Plane According to JavaScript: the Good Parts : JavaScript was built at a time when Unicode was a 16-bit character set, so all characters in JavaScript are 16 bits wide. This leads me to believe that JavaScript uses UCS-2 (not UTF-16!) and can only handle characters up to U+FFFF. Further investigation confirms this: > String.fromCharCode(0x20001); The fromCharCode method seems to only use the lowest 16 bits when returning the Unicode character. Trying to get U+20001 (CJK unified ideograph 20001) instead returns U+0001. Question: is it at all possible to handle post

Warning raised by inserting 4-byte unicode to mysql

情到浓时终转凉″ 提交于 2019-11-26 14:44:14
问题 Look at the following: /home/kinka/workspace/py/tutorial/tutorial/pipelines.py:33: Warning: Incorrect string value: '\xF0\x9F\x91\x8A\xF0\x9F...' for column 't_content' at row 1 n = self.cursor.execute(self.sql, (item['topic'], item['url'], item['content'])) The string '\xF0\x9F\x91\x8A , actually is a 4-byte unicode: u'\U0001f62a' . The mysql's character-set is utf-8 but inserting 4-byte unicode it will truncate the inserted string. I googled for such a problem and found that mysql under 5.5

What are the most common non-BMP Unicode characters in actual use? [closed]

强颜欢笑 提交于 2019-11-26 12:16:54
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 5 years ago . In your experience which Unicode characters, codepoints, ranges outside the BMP (Basic Multilingual Plane) are the most common so far? These are the ones which require 4 bytes in UTF-8 or surrogates in UTF-16. I would\'ve expected the answer to be Chinese and Japanese characters

JavaScript strings outside of the BMP

♀尐吖头ヾ 提交于 2019-11-26 04:23:05
问题 BMP being Basic Multilingual Plane According to JavaScript: the Good Parts : JavaScript was built at a time when Unicode was a 16-bit character set, so all characters in JavaScript are 16 bits wide. This leads me to believe that JavaScript uses UCS-2 (not UTF-16!) and can only handle characters up to U+FFFF. Further investigation confirms this: > String.fromCharCode(0x20001); The fromCharCode method seems to only use the lowest 16 bits when returning the Unicode character. Trying to get U