问题
So I\'m trying to make a Python script that downloads webcomics and puts them in a folder on my desktop. I\'ve found a few similar programs on here that do something similar, but nothing quite like what I need. The one that I found most similar is right here (http://bytes.com/topic/python/answers/850927-problem-using-urllib-download-images). I tried using this code:
>>> import urllib
>>> image = urllib.URLopener()
>>> image.retrieve(\"http://www.gunnerkrigg.com//comics/00000001.jpg\",\"00000001.jpg\")
(\'00000001.jpg\', <httplib.HTTPMessage instance at 0x1457a80>)
I then searched my computer for a file \"00000001.jpg\", but all I found was the cached picture of it. I\'m not even sure it saved the file to my computer. Once I understand how to get the file downloaded, I think I know how to handle the rest. Essentially just use a for loop and split the string at the \'00000000\'.\'jpg\' and increment the \'00000000\' up to the largest number, which I would have to somehow determine. Any reccomendations on the best way to do this or how to download the file correctly?
Thanks!
EDIT 6/15/10
Here is the completed script, it saves the files to any directory you choose. For some odd reason, the files weren\'t downloading and they just did. Any suggestions on how to clean it up would be much appreciated. I\'m currently working out how to find out many comics exist on the site so I can get just the latest one, rather than having the program quit after a certain number of exceptions are raised.
import urllib
import os
comicCounter=len(os.listdir(\'/file\'))+1 # reads the number of files in the folder to start downloading at the next comic
errorCount=0
def download_comic(url,comicName):
\"\"\"
download a comic in the form of
url = http://www.example.com
comicName = \'00000000.jpg\'
\"\"\"
image=urllib.URLopener()
image.retrieve(url,comicName) # download comicName at URL
while comicCounter <= 1000: # not the most elegant solution
os.chdir(\'/file\') # set where files download to
try:
if comicCounter < 10: # needed to break into 10^n segments because comic names are a set of zeros followed by a number
comicNumber=str(\'0000000\'+str(comicCounter)) # string containing the eight digit comic number
comicName=str(comicNumber+\".jpg\") # string containing the file name
url=str(\"http://www.gunnerkrigg.com//comics/\"+comicName) # creates the URL for the comic
comicCounter+=1 # increments the comic counter to go to the next comic, must be before the download in case the download raises an exception
download_comic(url,comicName) # uses the function defined above to download the comic
print url
if 10 <= comicCounter < 100:
comicNumber=str(\'000000\'+str(comicCounter))
comicName=str(comicNumber+\".jpg\")
url=str(\"http://www.gunnerkrigg.com//comics/\"+comicName)
comicCounter+=1
download_comic(url,comicName)
print url
if 100 <= comicCounter < 1000:
comicNumber=str(\'00000\'+str(comicCounter))
comicName=str(comicNumber+\".jpg\")
url=str(\"http://www.gunnerkrigg.com//comics/\"+comicName)
comicCounter+=1
download_comic(url,comicName)
print url
else: # quit the program if any number outside this range shows up
quit
except IOError: # urllib raises an IOError for a 404 error, when the comic doesn\'t exist
errorCount+=1 # add one to the error count
if errorCount>3: # if more than three errors occur during downloading, quit the program
break
else:
print str(\"comic\"+ \' \' + str(comicCounter) + \' \' + \"does not exist\") # otherwise say that the certain comic number doesn\'t exist
print \"all comics are up to date\" # prints if all comics are downloaded
回答1:
Using urllib.urlretrieve:
import urllib
urllib.urlretrieve("http://www.gunnerkrigg.com//comics/00000001.jpg", "00000001.jpg")
回答2:
import urllib
f = open('00000001.jpg','wb')
f.write(urllib.urlopen('http://www.gunnerkrigg.com//comics/00000001.jpg').read())
f.close()
回答3:
Just for the record, using requests library.
import requests
f = open('00000001.jpg','wb')
f.write(requests.get('http://www.gunnerkrigg.com//comics/00000001.jpg').content)
f.close()
Though it should check for requests.get() error.
回答4:
For Python 3 you will need to import import urllib.request
:
import urllib.request
urllib.request.urlretrieve(url, filename)
for more info check out the link
回答5:
Python 3 version of @DiGMi's answer:
from urllib import request
f = open('00000001.jpg', 'wb')
f.write(request.urlopen("http://www.gunnerkrigg.com/comics/00000001.jpg").read())
f.close()
回答6:
I have found this answer and I edit that in more reliable way
def download_photo(self, img_url, filename):
try:
image_on_web = urllib.urlopen(img_url)
if image_on_web.headers.maintype == 'image':
buf = image_on_web.read()
path = os.getcwd() + DOWNLOADED_IMAGE_PATH
file_path = "%s%s" % (path, filename)
downloaded_image = file(file_path, "wb")
downloaded_image.write(buf)
downloaded_image.close()
image_on_web.close()
else:
return False
except:
return False
return True
From this you never get any other resources or exceptions while downloading.
回答7:
If you know that the files are located in the same directory dir
of the website site
and have the following format: filename_01.jpg, ..., filename_10.jpg then download all of them:
import requests
for x in range(1, 10):
str1 = 'filename_%2.2d.jpg' % (x)
str2 = 'http://site/dir/filename_%2.2d.jpg' % (x)
f = open(str1, 'wb')
f.write(requests.get(str2).content)
f.close()
回答8:
It's easiest to just use .read()
to read the partial or entire response, then write it into a file you've opened in a known good location.
回答9:
Maybe you need 'User-Agent':
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36')]
response = opener.open('http://google.com')
htmlData = response.read()
f = open('file.txt','w')
f.write(htmlData )
f.close()
回答10:
Aside from suggesting you read the docs for retrieve()
carefully (http://docs.python.org/library/urllib.html#urllib.URLopener.retrieve), I would suggest actually calling read()
on the content of the response, and then saving it into a file of your choosing rather than leaving it in the temporary file that retrieve creates.
回答11:
All the above codes, do not allow to preserve the original image name, which sometimes is required. This will help in saving the images to your local drive, preserving the original image name
IMAGE = URL.rsplit('/',1)[1]
urllib.urlretrieve(URL, IMAGE)
Try this for more details.
回答12:
This worked for me using python 3.
It gets a list of URLs from the csv file and starts downloading them into a folder. In case the content or image does not exist it takes that exception and continues making its magic.
import urllib.request
import csv
import os
errorCount=0
file_list = "/Users/$USER/Desktop/YOUR-FILE-TO-DOWNLOAD-IMAGES/image_{0}.jpg"
# CSV file must separate by commas
# urls.csv is set to your current working directory make sure your cd into or add the corresponding path
with open ('urls.csv') as images:
images = csv.reader(images)
img_count = 1
print("Please Wait.. it will take some time")
for image in images:
try:
urllib.request.urlretrieve(image[0],
file_list.format(img_count))
img_count += 1
except IOError:
errorCount+=1
# Stop in case you reach 100 errors downloading images
if errorCount>100:
break
else:
print ("File does not exist")
print ("Done!")
回答13:
A simpler solution may be(python 3):
import urllib.request
import os
os.chdir("D:\\comic") #your path
i=1;
s="00000000"
while i<1000:
try:
urllib.request.urlretrieve("http://www.gunnerkrigg.com//comics/"+ s[:8-len(str(i))]+ str(i)+".jpg",str(i)+".jpg")
except:
print("not possible" + str(i))
i+=1;
回答14:
What about this:
import urllib, os
def from_url( url, filename = None ):
'''Store the url content to filename'''
if not filename:
filename = os.path.basename( os.path.realpath(url) )
req = urllib.request.Request( url )
try:
response = urllib.request.urlopen( req )
except urllib.error.URLError as e:
if hasattr( e, 'reason' ):
print( 'Fail in reaching the server -> ', e.reason )
return False
elif hasattr( e, 'code' ):
print( 'The server couldn\'t fulfill the request -> ', e.code )
return False
else:
with open( filename, 'wb' ) as fo:
fo.write( response.read() )
print( 'Url saved as %s' % filename )
return True
##
def main():
test_url = 'http://cdn.sstatic.net/stackoverflow/img/favicon.ico'
from_url( test_url )
if __name__ == '__main__':
main()
回答15:
If you need proxy support you can do this:
if needProxy == False:
returnCode, urlReturnResponse = urllib.urlretrieve( myUrl, fullJpegPathAndName )
else:
proxy_support = urllib2.ProxyHandler({"https":myHttpProxyAddress})
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)
urlReader = urllib2.urlopen( myUrl ).read()
with open( fullJpegPathAndName, "w" ) as f:
f.write( urlReader )
回答16:
Another way to do this is via the fastai library. This worked like a charm for me. I was facing a SSL: CERTIFICATE_VERIFY_FAILED Error
using urlretrieve
so I tried that.
url = 'https://www.linkdoesntexist.com/lennon.jpg'
fastai.core.download_url(url,'image1.jpg', show_progress=False)
来源:https://stackoverflow.com/questions/3042757/downloading-a-picture-via-urllib-and-python