Ruby String.encode still gives “invalid byte sequence in UTF-8”

后端 未结 3 1248
说谎
说谎 2020-12-28 10:28

In IRB, I\'m trying the following:

1.9.3p194 :001 > foo = \"\\xBF\".encode(\"utf-8\", :invalid => :replace, :undef => :replace)
 => \"\\xBF\" 
1.         


        
3条回答
  •  借酒劲吻你
    2020-12-28 11:22

    I'd guess that "\xBF" already thinks it is encoded in UTF-8 so when you call encode, it thinks you're trying to encode a UTF-8 string in UTF-8 and does nothing:

    >> s = "\xBF"
    => "\xBF"
    >> s.encoding
    => #
    

    \xBF isn't valid UTF-8 so this is, of course, nonsense. But if you use the three argument form of encode:

    encode(dst_encoding, src_encoding [, options] ) → str

    [...] The second form returns a copy of str transcoded from src_encoding to dst_encoding.

    You can force the issue by telling encode to ignore what the string thinks its encoding is and treat it as binary data:

    >> foo = s.encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
    => "�"
    

    Where s is the "\xBF" that thinks it is UTF-8 from above.

    You could also use force_encoding on s to force it to be binary and then use the two-argument encode:

    >> s.encoding
    => #
    >> s.force_encoding('binary')
    => "\xBF"
    >> s.encoding
    => #
    >> foo = s.encode('utf-8', :invalid => :replace, :undef => :replace)
    => "�"
    

提交回复
热议问题