Get a unicode from python's str byte sequence

允我心安 提交于 2019-12-23 01:02:06

问题


I have an old django app which was saving UTF-8 strings in the database in a way that made some look like invalid utf8 when I try to fetch them in Ruby.

Strings before saving were of type str in python, but when fetched from the database django was giving me a proper unicode string. When I fetch same record in rails I get a byte sequence that is identical to python's str string and ruby complains that it's an invalid byte sequence.

Example: tested string was a single emoji: 🔥

  • before save in Django - str type, sequence: [237, 160, 189, 237, 180, 165]

  • fetched from db in Django - unicode type, sequence [55357, 56613]

  • fetched from db in Rails - sequence [237, 160, 189, 237, 180, 165]

Is there a way to convert that byte sequence in Ruby to a proper utf8 string?


回答1:


I can't solve your problem but I can explain that byte sequence. What you have is UTF-8 encoded UTF-16.

Both, 237, 160, 189 and 237, 180, 165 are 3-byte UTF-8 sequences:

  • 1110xxxx 10xxxxxx 10xxxxxx (the x's are the relevant bits)

... which translate to codepoints 55357 and 56613 respectively: (or 0xD83D and 0xDD25 in hex)

[237, 160, 189, 237, 180, 165].map { |b| b.to_s(2) }
#=> ["11101101", "10100000", "10111101", "11101101", "10110100", "10100101"]
#         ^^^^      ^^^^^^      ^^^^^^        ^^^^      ^^^^^^      ^^^^^^

[0b1101_100000_111101, 0b1101_110100_100101]
#=> [55357, 56613]

Unfortunately, these codepoints are invalid in UTF-8. That's because they are actually UTF-16 bytes:

[55357, 56613].pack('S>2').encode('utf-8', 'utf-16be')
#=> "🔥"


来源:https://stackoverflow.com/questions/49145161/get-a-unicode-from-pythons-str-byte-sequence

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!