urllib.urlretrieve
returns silently even if the file doesn\'t exist on the remote http server, it just saves a html page to the named file. For example:
I ended up with my own retrieve
implementation, with the help of pycurl
it supports more protocols than urllib/urllib2, hope it can help other people.
import tempfile
import pycurl
import os
def get_filename_parts_from_url(url):
fullname = url.split('/')[-1].split('#')[0].split('?')[0]
t = list(os.path.splitext(fullname))
if t[1]:
t[1] = t[1][1:]
return t
def retrieve(url, filename=None):
if not filename:
garbage, suffix = get_filename_parts_from_url(url)
f = tempfile.NamedTemporaryFile(suffix = '.' + suffix, delete=False)
filename = f.name
else:
f = open(filename, 'wb')
c = pycurl.Curl()
c.setopt(pycurl.URL, str(url))
c.setopt(pycurl.WRITEFUNCTION, f.write)
try:
c.perform()
except:
filename = None
finally:
c.close()
f.close()
return filename
Consider using urllib2
if it possible in your case. It is more advanced and easy to use than urllib
.
You can detect any HTTP errors easily:
>>> import urllib2
>>> resp = urllib2.urlopen("http://google.com/abc.jpg")
Traceback (most recent call last):
<<MANY LINES SKIPPED>>
urllib2.HTTPError: HTTP Error 404: Not Found
resp
is actually HTTPResponse
object that you can do a lot of useful things with:
>>> resp = urllib2.urlopen("http://google.com/")
>>> resp.code
200
>>> resp.headers["content-type"]
'text/html; charset=windows-1251'
>>> resp.read()
"<<ACTUAL HTML>>"
You can create a new URLopener (inherit from FancyURLopener) and throw exceptions or handle errors any way you want. Unfortunately, FancyURLopener ignores 404 and other errors. See this question:
How to catch 404 error in urllib.urlretrieve
:) My first post on StackOverflow, have been a lurker for years. :)
Sadly dir(urllib.urlretrieve) is deficient in useful information. So from this thread thus far I tried writing this:
a,b = urllib.urlretrieve(imgURL, saveTo)
print "A:", a
print "B:", b
which produced this:
A: /home/myuser/targetfile.gif
B: Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Cache-Control: max-age=604800
Content-Type: image/gif
Date: Mon, 07 Mar 2016 23:37:34 GMT
Etag: "4e1a5d9cc0857184df682518b9b0da33"
Last-Modified: Sun, 06 Mar 2016 21:16:48 GMT
Server: ECS (hnd/057A)
Timing-Allow-Origin: *
X-Cache: HIT
Content-Length: 27027
Connection: close
I guess one can check:
if b.Content-Length > 0:
My next step is to test a scenario where the retrieve fails...
According to the documentation is is undocumented
to get access to the message it looks like you do something like:
a, b=urllib.urlretrieve('http://google.com/abc.jpg', r'c:\abc.jpg')
b is the message instance
Since I have learned that Python it is always useful to use Python's ability to be introspective when I type
dir(b)
I see lots of methods or functions to play with
And then I started doing things with b
for example
b.items()
Lists lots of interesting things, I suspect that playing around with these things will allow you to get the attribute you want to manipulate.
Sorry this is such a beginner's answer but I am trying to master how to use the introspection abilities to improve my learning and your questions just popped up.
Well I tried something interesting related to this-I was wondering if I could automatically get the output from each of the things that showed up in the directory that did not need parameters so I wrote:
needparam=[]
for each in dir(b):
x='b.'+each+'()'
try:
eval(x)
print x
except:
needparam.append(x)
class MyURLopener(urllib.FancyURLopener):
http_error_default = urllib.URLopener.http_error_default
url = "http://page404.com"
filename = "download.txt"
def reporthook(blockcount, blocksize, totalsize):
pass
...
try:
(f,headers)=MyURLopener().retrieve(url, filename, reporthook)
except Exception, e:
print e