Python can't encode with surrogateescape

谁说我不能喝 提交于 2020-01-16 01:36:47

问题


I have a problem with Unicode surrogates encoding in Python (3.4):

>>> b'\xCC'.decode('utf-16_be', 'surrogateescape').encode('utf-16_be', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-be' codec can't encode character '\udccc' in position 0: surrogates not allowed

If I'm not mistaken, according to Python documentation:

'surrogateescape': On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data.

The code should just produce the source sequence (b'\xCC'). So why is the exception raised instead?

This is possibly related to my second question:

Changed in version 3.4: The utf-16* and utf-32* encoders no longer allow surrogate code points (U+D800–U+DFFF) to be encoded.

(From https://docs.python.org/3/library/codecs.html#standard-encodings)

From as far as I know, it's impossible to encode some code points to UTF-16 without surrogate pairs. So what's the reason behind this?


回答1:


This change was made because the Unicode standard explicitly disallows such encodings. See issue #12892, but apparently the surrogateescape error handler cannot be made to work with UTF-16 or UTF-32, because these codecs are not ASCII compatible.

Specifically:

I tested utf_16_32_surrogates_4.patch: surrogateescape with as encoder does not work as expected.

>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'ignore')
'[]'
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'replace')
'[�]'
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'surrogateescape')
'[\udc80\udcdc\uffff'

=> I expected '[\udc80\udcdc]'.

to which came the response:

Yes, surrogateescape doesn't work with ASCII incompatible encodings and can't. First, it can't represent the result of decoding b'\x00\xd8' from utf-16-le or b'ABCD' from utf-32*. This problem is worth separated issue (or even PEP) and discussion on Python-Dev.

I believe the surrogateescape handler was more meant for UTF-8 data; that decoding to UTF-16 or UTF-32 works with it too now is a nice extra but it can't work in the other direction, apparently.




回答2:


If you use surrogatepass (instead of surrogateescape), things should work on Python 3.

See: https://docs.python.org/3/library/codecs.html#codec-base-classes (which says that surrogatepass allows encoding and decoding of surrogate codes (for utf related encoding).



来源:https://stackoverflow.com/questions/31898353/python-cant-encode-with-surrogateescape

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!