问题
I have an old django app which was saving UTF-8 strings in the database in a way that made some look like invalid utf8 when I try to fetch them in Ruby.
Strings before saving were of type str in python, but when fetched from the database django was giving me a proper unicode string. When I fetch same record in rails I get a byte sequence that is identical to python's str string and ruby complains that it's an invalid byte sequence.
Example: tested string was a single emoji: 🔥
before save in Django -
strtype, sequence:[237, 160, 189, 237, 180, 165]fetched from db in Django -
unicodetype, sequence[55357, 56613]fetched from db in Rails - sequence
[237, 160, 189, 237, 180, 165]
Is there a way to convert that byte sequence in Ruby to a proper utf8 string?
回答1:
I can't solve your problem but I can explain that byte sequence. What you have is UTF-8 encoded UTF-16.
Both, 237, 160, 189 and 237, 180, 165 are 3-byte UTF-8 sequences:
1110xxxx 10xxxxxx 10xxxxxx(thex's are the relevant bits)
... which translate to codepoints 55357 and 56613 respectively: (or 0xD83D and 0xDD25 in hex)
[237, 160, 189, 237, 180, 165].map { |b| b.to_s(2) }
#=> ["11101101", "10100000", "10111101", "11101101", "10110100", "10100101"]
# ^^^^ ^^^^^^ ^^^^^^ ^^^^ ^^^^^^ ^^^^^^
[0b1101_100000_111101, 0b1101_110100_100101]
#=> [55357, 56613]
Unfortunately, these codepoints are invalid in UTF-8. That's because they are actually UTF-16 bytes:
[55357, 56613].pack('S>2').encode('utf-8', 'utf-16be')
#=> "🔥"
来源:https://stackoverflow.com/questions/49145161/get-a-unicode-from-pythons-str-byte-sequence