How to know if urllib.urlretrieve succeeds?

后端 未结 8 1916
长情又很酷
长情又很酷 2020-11-30 01:20

urllib.urlretrieve returns silently even if the file doesn\'t exist on the remote http server, it just saves a html page to the named file. For example:

相关标签:
8条回答
  • 2020-11-30 01:30

    I ended up with my own retrieve implementation, with the help of pycurl it supports more protocols than urllib/urllib2, hope it can help other people.

    import tempfile
    import pycurl
    import os
    
    def get_filename_parts_from_url(url):
        fullname = url.split('/')[-1].split('#')[0].split('?')[0]
        t = list(os.path.splitext(fullname))
        if t[1]:
            t[1] = t[1][1:]
        return t
    
    def retrieve(url, filename=None):
        if not filename:
            garbage, suffix = get_filename_parts_from_url(url)
            f = tempfile.NamedTemporaryFile(suffix = '.' + suffix, delete=False)
            filename = f.name
        else:
            f = open(filename, 'wb')
        c = pycurl.Curl()
        c.setopt(pycurl.URL, str(url))
        c.setopt(pycurl.WRITEFUNCTION, f.write)
        try:
            c.perform()
        except:
            filename = None
        finally:
            c.close()
            f.close()
        return filename
    
    0 讨论(0)
  • 2020-11-30 01:32

    Consider using urllib2 if it possible in your case. It is more advanced and easy to use than urllib.

    You can detect any HTTP errors easily:

    >>> import urllib2
    >>> resp = urllib2.urlopen("http://google.com/abc.jpg")
    Traceback (most recent call last):
    <<MANY LINES SKIPPED>>
    urllib2.HTTPError: HTTP Error 404: Not Found
    

    resp is actually HTTPResponse object that you can do a lot of useful things with:

    >>> resp = urllib2.urlopen("http://google.com/")
    >>> resp.code
    200
    >>> resp.headers["content-type"]
    'text/html; charset=windows-1251'
    >>> resp.read()
    "<<ACTUAL HTML>>"
    
    0 讨论(0)
  • 2020-11-30 01:36

    You can create a new URLopener (inherit from FancyURLopener) and throw exceptions or handle errors any way you want. Unfortunately, FancyURLopener ignores 404 and other errors. See this question:

    How to catch 404 error in urllib.urlretrieve

    0 讨论(0)
  • 2020-11-30 01:42

    :) My first post on StackOverflow, have been a lurker for years. :)

    Sadly dir(urllib.urlretrieve) is deficient in useful information. So from this thread thus far I tried writing this:

    a,b = urllib.urlretrieve(imgURL, saveTo)
    print "A:", a
    print "B:", b
    

    which produced this:

    A: /home/myuser/targetfile.gif
    B: Accept-Ranges: bytes
    Access-Control-Allow-Origin: *
    Cache-Control: max-age=604800
    Content-Type: image/gif
    Date: Mon, 07 Mar 2016 23:37:34 GMT
    Etag: "4e1a5d9cc0857184df682518b9b0da33"
    Last-Modified: Sun, 06 Mar 2016 21:16:48 GMT
    Server: ECS (hnd/057A)
    Timing-Allow-Origin: *
    X-Cache: HIT
    Content-Length: 27027
    Connection: close
    

    I guess one can check:

    if b.Content-Length > 0:
    

    My next step is to test a scenario where the retrieve fails...

    0 讨论(0)
  • 2020-11-30 01:47

    According to the documentation is is undocumented

    to get access to the message it looks like you do something like:

    a, b=urllib.urlretrieve('http://google.com/abc.jpg', r'c:\abc.jpg')
    

    b is the message instance

    Since I have learned that Python it is always useful to use Python's ability to be introspective when I type

    dir(b) 
    

    I see lots of methods or functions to play with

    And then I started doing things with b

    for example

    b.items()
    

    Lists lots of interesting things, I suspect that playing around with these things will allow you to get the attribute you want to manipulate.

    Sorry this is such a beginner's answer but I am trying to master how to use the introspection abilities to improve my learning and your questions just popped up.

    Well I tried something interesting related to this-I was wondering if I could automatically get the output from each of the things that showed up in the directory that did not need parameters so I wrote:

    needparam=[]
    for each in dir(b):
        x='b.'+each+'()'
        try:
            eval(x)
            print x
        except:
            needparam.append(x)
    
    0 讨论(0)
  • 2020-11-30 01:49
    class MyURLopener(urllib.FancyURLopener):
        http_error_default = urllib.URLopener.http_error_default
    
    url = "http://page404.com"
    filename = "download.txt"
    def reporthook(blockcount, blocksize, totalsize):
        pass
        ...
    
    try:
        (f,headers)=MyURLopener().retrieve(url, filename, reporthook)
    except Exception, e:
        print e
    
    0 讨论(0)
提交回复
热议问题