IncompleteRead using httplib

前端 未结 3 1164
执念已碎
执念已碎 2020-12-01 10:29

I have been having a persistent problem getting an rss feed from a particular website. I wound up writing a rather ugly procedure to perform this function, but I am curious

相关标签:
3条回答
  • 2020-12-01 11:06

    I find out in my case, send a HTTP/1.0 request , fix the problem, just adding this to the code:

    import httplib
    httplib.HTTPConnection._http_vsn = 10
    httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
    

    after I do the request :

    req = urllib2.Request(url, post, headers)
    filedescriptor = urllib2.urlopen(req)
    img = filedescriptor.read()
    

    after I back to http 1.1 with (for connections that support 1.1) :

    httplib.HTTPConnection._http_vsn = 11
    httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'
    
    0 讨论(0)
  • 2020-12-01 11:07

    I have fixed the issue by using HTTPS instead of HTTP and its working fine. No code change was required.

    0 讨论(0)
  • 2020-12-01 11:28

    At the end of the day, all of the other modules (feedparser, mechanize, and urllib2) call httplib which is where the exception is being thrown.

    Now, first things first, I also downloaded this with wget and the resulting file was 1854 bytes. Next, I tried with urllib2:

    >>> import urllib2
    >>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
    >>> f = urllib2.urlopen(url)
    >>> f.headers.headers
    ['Cache-Control: private\r\n',
     'Content-Type: text/xml; charset=utf-8\r\n',
     'Server: Microsoft-IIS/7.5\r\n',
     'X-AspNet-Version: 4.0.30319\r\n',
     'X-Powered-By: ASP.NET\r\n',
     'Date: Mon, 07 Jan 2013 23:21:51 GMT\r\n',
     'Via: 1.1 BC1-ACLD\r\n',
     'Transfer-Encoding: chunked\r\n',
     'Connection: close\r\n']
    >>> f.read()
    < Full traceback cut >
    IncompleteRead: IncompleteRead(1854 bytes read)
    

    So it is reading all 1854 bytes but then thinks there is more to come. If we explicitly tell it to read only 1854 bytes it works:

    >>> f = urllib2.urlopen(url)
    >>> f.read(1854)
    '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'
    

    Obviously, this is only useful if we always know the exact length ahead of time. We can use the fact the partial read is returned as an attribute on the exception to capture the entire contents:

    >>> try:
    ...     contents = f.read()
    ... except httplib.IncompleteRead as e:
    ...     contents = e.partial
    ...
    >>> print contents
    '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'
    

    This blog post suggests this is a fault of the server, and describes how to monkey-patch the httplib.HTTPResponse.read() method with the try..except block above to handle things behind the scenes:

    import httplib
    
    def patch_http_response_read(func):
        def inner(*args):
            try:
                return func(*args)
            except httplib.IncompleteRead, e:
                return e.partial
    
        return inner
    
    httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)
    

    I applied the patch and then feedparser worked:

    >>> import feedparser
    >>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
    >>> feedparser.parse(url)
    {'bozo': 0,
     'encoding': 'utf-8',
     'entries': ...
     'status': 200,
     'version': 'rss20'}
    

    This isn't the nicest way of doing things, but it seems to work. I'm not expert enough in the HTTP protocols to say for sure whether the server is doing things wrong, or whether httplib is mis-handling an edge case.

    0 讨论(0)
提交回复
热议问题