urllib2 opener providing wrong charset

前端 未结 2 698
既然无缘
既然无缘 2020-12-11 07:21

When I open the url and read it, I can\'t recognize it. But when I check the content header it says it is encoded as utf-8. So I tried to convert it to unicode and it compla

相关标签:
2条回答
  • 2020-12-11 07:36

    This is a common mistake. The server sends gzipped stream.

    You should unpack it first:

    response = opener.open(self.__url, data)
    if response.info().get('Content-Encoding') == 'gzip':
        buf = StringIO.StringIO( response.read())
        gzip_f = gzip.GzipFile(fileobj=buf)
        content = gzip_f.read()
    else:
        content = response.read()
    
    0 讨论(0)
  • 2020-12-11 07:47

    The header is probably wrong. Check out chardet.

    EDIT: Thinking more about it -- my money is on the contents being gzipped. I believe some of Python's various URL-opening modules/classes/etc will ungzip, while others won't.

    0 讨论(0)
提交回复
热议问题