Get html using Python requests?

后端 未结 2 1962
情歌与酒
情歌与酒 2020-12-03 09:47

I am trying to teach myself some basic web scraping. Using Python\'s requests module, I was able to grab html for various websites until I tried this:

>&         


        
相关标签:
2条回答
  • 2020-12-03 10:34

    The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:

    $ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
    HTTP/1.1 200 OK
    Date: Tue, 06 Jan 2015 17:46:49 GMT
    Server: Apache
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
    Vary: Accept-Encoding
    Content-Encoding: gzip
    Content-Length: 3659
    Content-Type: text/html
    

    The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.

    As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.

    The work-around is to tell the server not to bother with compression:

    headers = {'Accept-Encoding': 'identity'}
    r = requests.get(url, headers=headers)
    

    and an uncompressed response is returned.

    Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:

    >>> pprint(dict(r.headers))
    {'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
     'connection': 'Keep-Alive',
     'content-encoding': 'gzip',
     'content-length': '3659',
     'content-type': 'text/html',
     'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
     'keep-alive': 'timeout=5, max=100',
     'server': 'Apache',
     'vary': 'Accept-Encoding'}
    

    and the content-encoding information survives, so there requests decodes the content for you, as expected.

    0 讨论(0)
  • 2020-12-03 10:46

    The HTTP headers for this URL have now been fixed.

    >>> import requests
    >>> print requests.__version__
    2.5.1
    >>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
    >>> r.text[:100]
    u'\n<!DOCTYPE html>\n<HTML>\n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H'
    >>> r.headers
    {'content-length': '3672', 'content-encoding': 'gzip', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Thu, 12 Feb 2015 18:59:37 GMT', 'content-type': 'text/html; charset=utf-8'}
    
    0 讨论(0)
提交回复
热议问题