Weird leading characters utf-8/utf-16 encoding in Python

两盒软妹~` 提交于 2020-07-03 11:53:27

问题


I have written a simplified version to demonstrate the problem. I am encoding special characters in utf-8 and UTF-16 format.

With utf-8 encoding there is no problem, when I am encoding with UTF-16 I get some weird leading characters.

I tried to remove all trailing and leading characters but still the error persists.

Sample of code:

#!/usr/bin/env python2
# -*- coding: utf-8 -*-

import chardet


def myEncode(s, pattern):
try:
    s.strip()
    u = unicode(s, pattern)
    print chardet.detect(u.encode(pattern, 'strict'))
    return u.encode(pattern, 'strict')
except UnicodeDecodeError as err:
    return "UnicodeDecodeError: ", err
except Exception as err:
    return "ExceptionError: ", err

print myEncode(r"""Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§""",
               'utf-8')
print myEncode(r"""Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§""",
               'utf-16')

Sample of output:

{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}
Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§
{'confidence': 1.0, 'language': '', 'encoding': 'UTF-16'}
��Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§

Where I am going wrong I can not figure it out. I do not want to convert the UTF-16 back to utf-8 it is important for me to keep the format on UTF-16.

Update: Thanks to @tripleee the solution to my problem is to define encoding UTF-16le or UTF-16be. Thanks again for your time and effort.

Thanks in advance for everyone time and effort.


回答1:


Answer to the problem was given by @tripleee.

By defining utf-16le or utf-16be instead of utf-16 resolved the problem.

Sample of solution:

#!/usr/bin/env python2
# -*- coding: utf-8 -*-

import chardet


def myEncode(s, pattern):
    try:
        s.strip()
        u = unicode(s, pattern)
        print chardet.detect(u.encode(pattern, 'strict'))
        return u.encode(pattern, 'strict')
    except UnicodeDecodeError as err:
        return "UnicodeDecodeError: ", err
    except Exception as err:
        return "ExceptionError: ", err

print myEncode(r"""Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§""",
               'utf-8')
print myEncode(r"""Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§""",
               'utf-16be')

Sample of output:

{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}
Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§
{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}
Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§


来源:https://stackoverflow.com/questions/47713964/weird-leading-characters-utf-8-utf-16-encoding-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!