问题
I have been browsing through several posts here but I just cannot get my head around batch-downloading images and text from a given URL with Python.
import urllib,urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
import os, sys
def getAllImages(url):
query = urllib2.Request(url)
user_agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)"
query.add_header("User-Agent", user_agent)
page = BeautifulSoup(urllib2.urlopen(query))
for div in page.findAll("div", {"class": "thumbnail"}):
print "found thumbnail"
for img in div.findAll("img"):
print "found image"
src = img["src"]
if src:
src = absolutize(src, pageurl)
f = open(src,'wb')
f.write(urllib.urlopen(src).read())
f.close()
for h5 in div.findAll("h5"):
print "found Headline"
value = (h5.contents[0])
print >> headlines.txt, value
def main():
getAllImages("http://www.nytimes.com/")
Above is now some updated code. What happens, is nothing. The code does not get to find any div with a thumbnail, obviously, no result in any of the print.... So probably I am missing some pointers in getting to the right divs containing the images and headlines?
Thanks a lot!
回答1:
The OS you are using doesn't know how to write to the file path you are passing it in src. Make sure that the name you use to save the file to disk is one the OS can actually use:
src = "abc.com/alpha/beta/charlie.jpg"
with open(src, "wb") as f:
# IOError - cannot open file abc.com/alpha/beta/charlie.jpg
src = "alpha/beta/charlie.jpg"
os.makedirs(os.path.dirname(src))
with open(src, "wb" as f:
# Golden - write file here
and everything will start working.
A couple of additional thoughts:
- Make sure to normalize the save file path (e. g.
os.path.join(some_root_dir, *relative_file_path*)) - otherwise you'll be writing images all over your hard drive depending on theirsrc. - Unless you are running tests of some kind, it's good to advertise that you are a bot in your
user_agentstring and honorrobots.txtfiles (or alternately, provide some kind of contact information so people can ask you to stop if they need to).
来源:https://stackoverflow.com/questions/7918239/batch-downloading-text-and-images-from-url-with-python-urllib-beautifulsoup