I\'m learning Python by following Automate the Boring Stuff. This program is supposed to go to http://xkcd.com/ and download all the images for offline viewing.
I\'
change your comicUrl
to this
comicUrl = comicElem[0].get('src').strip("http://")
comicUrl="http://"+comicUrl
if 'xkcd' not in comicUrl:
comicUrl=comicUrl[:7]+'xkcd.com/'+comicUrl[7:]
print "comic url",comicUrl
Actually this is not a bigdeal.you can see the comicUrl somewhat like this //imgs.xkcd.com/comics/acceptable_risk.png
The only thing you need to add is http:
, remember it is http:
and not http://
as some folks said earlier because already the url contatin double slashes.
so please change the code to
res = requests.get('http:' + comicElem[0].get('src'))
or
comicUrl = 'http:' + comicElem[0].get('src')
res = requests.get(comicUrl)
Happy coding
Id just like to chime in here that I had this exact same error and used @Ajay recommended answer above but even after adding that I as still getting problems, right after the program downloaded the first image it would stop and return this error:
ValueError: Unsupported or invalid CSS selector: "a[rel"
this was referring to one of the last lines in the program where it uses the 'Prev button' to go to the next image to download.
Anyway after going through the bs4 docs I made a slight change as follows and it seems to work just fine now:
prevLink = soup.select('a[rel^="prev"]')[0]
Someone else might run into the same problem so thought Id add this comment.
No schema means you haven't supplied the http://
or https://
supply these and it will do the trick.
Edit: Look at this URL string!:
URL '//imgs.xkcd.com/comics/the_martian.png':
Explanation:
A few XKCD pages have special content that isn’t a simple image file. That’s fine; you can just skip those. If your selector doesn’t find any elements, then soup.select('#comic img') will return a blank list.
Working Code:
import requests,os,bs4,shutil
url='http://xkcd.com'
#making new folder
if os.path.isdir('xkcd') == True:
shutil.rmtree('xkcd')
else:
os.makedirs('xkcd')
#scrapiing information
while not url.endswith('#'):
print('Downloading Page %s.....' %(url))
res = requests.get(url) #getting page
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
comicElem = soup.select('#comic img') #getting img tag under comic divison
if comicElem == []: #if not found print error
print('could not find comic image')
else:
try:
comicUrl = 'http:' + comicElem[0].get('src') #getting comic url and then downloading its image
print('Downloading image %s.....' %(comicUrl))
res = requests.get(comicUrl)
res.raise_for_status()
except requests.exceptions.MissingSchema:
#skip if not a normal image file
prev = soup.select('a[rel="prev"]')[0]
url = 'http://xkcd.com' + prev.get('href')
continue
imageFile = open(os.path.join('xkcd',os.path.basename(comicUrl)),'wb') #write downloaded image to hard disk
for chunk in res.iter_content(10000):
imageFile.write(chunk)
imageFile.close()
#get previous link and update url
prev = soup.select('a[rel="prev"]')[0]
url = "http://xkcd.com" + prev.get('href')
print('Done...')