Python 3.4 hex to Japanese Characters

£可爱£侵袭症+ 提交于 2019-12-04 14:53:54

You are confusing the Python representation with the contents. You are shown \xhh hex escapes used in Python string literals to keep the displayed value ASCII safe and reproducable.

You have UTF-8 data here:

>>> name = b"\xe5\xb9\xb4\xe3\x81\xab\xe4\xb8\x80\xe5\xba\xa6\xe3\x81\xae\xe6\x99\xb4\xe3\x82\x8c\xe5\xa7\xbf"
>>> name.decode('utf8')
'\u5e74\u306b\u4e00\u5ea6\u306e\u6674\u308c\u59ff'
>>> print(name.decode('utf8'))
年に一度の晴れ姿

Note that I used a bytes() string literal there, using b'...'. If your data is not a bytes object you have a Mojibake and need to encode to bytes first:

name.encode('latin1').decode('utf8')

Latin 1 maps codepoints one-on-one to bytes, so that's usually a safe bet to use in case of such data. It could be that you have a Mojibake in a different codec, it depends on how you retrieved the data.

If used open() to read the data from a file, you either specified the wrong encoding or relied on your platform default. use open(filename, encoding='utf8') to remedy that.

If you used the requests library to load this from a website, take into account that the response.text attribute uses latin-1 as the default codec if a) the site didn't specify a codec and b) the response has a text/* mime-type. If this is sourced from HTML, usually the codec is part of the HTML headers instead. Use a library like BeautifulSoup to handle HTML (using the response.content raw bytes) and it'll detect such information for you.

If all else fails, the ftfy library may still be able to fix a Mojibake; it uses specially constructed codecs to reverse common errors.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!