Warning raised by inserting 4-byte unicode to mysql

情到浓时终转凉″ 提交于 2019-11-26 14:44:14

问题


Look at the following:

/home/kinka/workspace/py/tutorial/tutorial/pipelines.py:33: Warning: Incorrect string 
value: '\xF0\x9F\x91\x8A\xF0\x9F...' for column 't_content' at row 1
n = self.cursor.execute(self.sql, (item['topic'], item['url'], item['content']))

The string '\xF0\x9F\x91\x8A, actually is a 4-byte unicode: u'\U0001f62a'. The mysql's character-set is utf-8 but inserting 4-byte unicode it will truncate the inserted string. I googled for such a problem and found that mysql under 5.5.3 don't support 4-byte unicode, and unfortunately mine is 5.5.224. I don't want to upgrade the mysql server, so I just want to filter the 4-byte unicode in python, I tried to use regular expression but failed. So, any help?


回答1:


If MySQL cannot handle UTF-8 codes of 4 bytes or more then you'll have to filter out all unicode characters over codepoint \U00010000; UTF-8 encodes codepoints below that threshold in 3 bytes or fewer.

You could use a regular expression for that:

>>> import re
>>> highpoints = re.compile(u'[\U00010000-\U0010ffff]')
>>> example = u'Some example text with a sleepy face: \U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '

Alternatively, you could use the .translate() function with a mapping table that only contains None values:

>>> nohigh = { i: None for i in xrange(0x10000, 0x110000) }
>>> example.translate(nohigh)
u'Some example text with a sleepy face: '

However, creating the translation table will eat a lot of memory and take some time to generate; it is probably not worth your effort as the regular expression approach is more efficient.

This all presumes you are using a UCS-4 compiled python. If your python was compiled with UCS-2 support then you can only use codepoints up to '\U0000ffff' in regular expressions and you'll never run into this problem in the first place.

I note that as of MySQL 5.5.3 the newly-added utf8mb4 codec does supports the full Unicode range.




回答2:


I think you should use utf8mb4 collation instead of utf8 and run

SET NAMES UTF8MB4

after connection with DB (link, link, link)




回答3:


simple normalization for string without regex and translate:

def normalize_unicode(s):
    return ''.join([ unichr(k) if k < 0x10000 else 0xfffd for k in [ord(c) for c in s]])


来源:https://stackoverflow.com/questions/10798605/warning-raised-by-inserting-4-byte-unicode-to-mysql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!