python encoding chinese to special character

问题

I have scrap/curl request to get html from other site, that have chinese language but some text result is weird, it showing like this:

°¢Àï°Í°ÍÎªÄúÌá¹©ÁË×ÔÁµÕß¹¤³§Ö±ÏúÆ·ÅÆµç×Ó±í ÖÇÄÜÊ±ÉÐ³±Á÷Å®Ê¿ÊÖ»·ÊÖÁ´Ê×ÊÎ±í´øµÈ²úÆ·£¬ÕâÀïÔÆ¼¯ÁËÖÚ¶àµÄ¹©Ó¦ÉÌ£¬²É¹ºÉÌ£¬ÖÆÔìÉÌ¡£ÓûÁË½â¸ü¶à×ÔÁµÕß¹¤³§Ö±ÏúÆ·ÅÆµç×Ó±í ÖÇÄÜÊ±ÉÐ³±Á÷Å®Ê¿ÊÖ»·ÊÖÁ´Ê×ÊÎ±í´øÐÅÏ¢£¬Çë·ÃÎÊ°¢Àï°Í°ÍÅú·¢Íø£¡

that should be in chinese language, and this is my code:

str(result.decode('ISO-8859-1'))

If without decode 'ISO-8859-1' (only return result variable) it will display question mark like this:

��Ͱ�Ϊ��ṩ��߹��ֱ��Ʒ�Ƶ��ӱ� ��ʱ�г��Ůʿ�ֻ��α��Ȳ�Ʒ��Ƽ��ڶ�Ĺ�Ӧ�̣��ɹ��̣��̡��˽��߹��ֱ��Ʒ�Ƶ��ӱ� ��ʱ�г��Ůʿ�ֻ��α��Ϣ��ʰ��Ͱ��

Could you help me which encode/decode that I should use?

Thanks

回答1:

Chinese has several possible charsets.
3 common chinese charsets are: gb2312,big5 and gbk.
Here is a snippet to convert from gb2312 to utf-8.

import codecs

infile = codecs.open("in.txt", "r", "gb2312")
lines = infile.readline()
infile.close()

print(lines)

outfile = codecs.open("out.txt", "wb", "utf-8")
outfile.writelines(lines)
outfile.close()

回答2:

It was really simple solution, as mentioned by @Thu Yein tun, to see the header response of the http request link for the content type, and I it showing as text/html;charset=GBK, then I give the solution to my code like this

result.decode('gbk')

回答3:

Try this block of code.

You can do by importing the unquote file & encode the content using latin1 encoding mechanism.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from urllib2 import unquote

bytesquoted = u'å%8f°å%8d%97 è¦ªå%90é¤%90å»³'.encode('latin1')
unquoted = unquote(bytesquoted)
print unquoted.decode('utf8')

Output :

台南親子餐廳

来源：https://stackoverflow.com/questions/53954604/python-encoding-chinese-to-special-character

标签

python

decode

encode