No schema supplied and other errors with using requests.get()

后端未结

关注

 5  1109

I\'m learning Python by following Automate the Boring Stuff. This program is supposed to go to http://xkcd.com/ and download all the images for offline viewing.

I\'

相关标签:

5条回答

自闭症患者

2020-12-09 10:28

change your comicUrl to this

comicUrl = comicElem[0].get('src').strip("http://")
comicUrl="http://"+comicUrl
if 'xkcd' not in comicUrl:
    comicUrl=comicUrl[:7]+'xkcd.com/'+comicUrl[7:]

print "comic url",comicUrl

0 讨论(0)

谎友^

2020-12-09 10:33
Actually this is not a bigdeal.you can see the comicUrl somewhat like this //imgs.xkcd.com/comics/acceptable_risk.png

The only thing you need to add is http: , remember it is http: and not http:// as some folks said earlier because already the url contatin double slashes. so please change the code to
```
res = requests.get('http:' + comicElem[0].get('src'))
```
or
```
comicUrl = 'http:' + comicElem[0].get('src')

res = requests.get(comicUrl)
```
Happy coding
0 讨论(0)
发布评论:

提交评论
- 加载中...
你的背包

2020-12-09 10:34
Id just like to chime in here that I had this exact same error and used @Ajay recommended answer above but even after adding that I as still getting problems, right after the program downloaded the first image it would stop and return this error:
```
ValueError: Unsupported or invalid CSS selector: "a[rel"
```
this was referring to one of the last lines in the program where it uses the 'Prev button' to go to the next image to download.

Anyway after going through the bs4 docs I made a slight change as follows and it seems to work just fine now:
```
prevLink = soup.select('a[rel^="prev"]')[0]
```
Someone else might run into the same problem so thought Id add this comment.
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉话见心

2020-12-09 10:40

No schema means you haven't supplied the http:// or https:// supply these and it will do the trick.

Edit: Look at this URL string!:

URL '//imgs.xkcd.com/comics/the_martian.png':

0 讨论(0)
发布评论:

提交评论
- 加载中...

北恋

2020-12-09 10:42

Explanation:

A few XKCD pages have special content that isn’t a simple image file. That’s fine; you can just skip those. If your selector doesn’t find any elements, then soup.select('#comic img') will return a blank list.

Working Code:

import requests,os,bs4,shutil

url='http://xkcd.com'

#making new folder
if os.path.isdir('xkcd') == True:
    shutil.rmtree('xkcd')
else:
    os.makedirs('xkcd')


#scrapiing information
while not url.endswith('#'):
    print('Downloading Page %s.....' %(url))
    res = requests.get(url)          #getting page
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text)

    comicElem = soup.select('#comic img')     #getting img tag under  comic divison
    if comicElem == []:                        #if not found print error
        print('could not find comic image')

    else:
        try:
            comicUrl = 'http:' + comicElem[0].get('src')             #getting comic url and then downloading its image
            print('Downloading image %s.....' %(comicUrl))
            res = requests.get(comicUrl)
            res.raise_for_status()

        except requests.exceptions.MissingSchema:
        #skip if not a normal image file
            prev = soup.select('a[rel="prev"]')[0]
            url = 'http://xkcd.com' + prev.get('href')
            continue

        imageFile = open(os.path.join('xkcd',os.path.basename(comicUrl)),'wb')     #write  downloaded image to hard disk
        for chunk in res.iter_content(10000):
            imageFile.write(chunk)
        imageFile.close()

        #get previous link and update url
        prev = soup.select('a[rel="prev"]')[0]
        url = "http://xkcd.com" + prev.get('href')


print('Done...')

0 讨论(0)