Why urllib returns garbage from some wikipedia articles?

前端 未结 3 1934
刺人心
刺人心 2021-01-14 00:33
>>> import urllib2

>>> good_article = \'http://en.wikipedia.org/wiki/Wikipedia\'
>>> bad_article = \'http://en.wikipedia.org/wiki/India\'         


        
3条回答
  •  南方客
    南方客 (楼主)
    2021-01-14 00:57

    It's not an environment, locale, or encoding problem. The offending stream of bytes is gzip-compressed. The \x1f\x8B at the start is what you get at the start of a gzip stream with the default settings.

    Looks as though the server is ignoring the fact that you didn't do

    req2.add_header('Accept-encoding', 'gzip')

    You should look at result.headers.getheader('Content-Encoding') and if necessary, decompress it yourself.

提交回复
热议问题