No schema supplied and other errors with using requests.get()

后端 未结 5 1109
情深已故
情深已故 2020-12-09 10:05

I\'m learning Python by following Automate the Boring Stuff. This program is supposed to go to http://xkcd.com/ and download all the images for offline viewing.

I\'

相关标签:
5条回答
  • 2020-12-09 10:28

    change your comicUrl to this

    comicUrl = comicElem[0].get('src').strip("http://")
    comicUrl="http://"+comicUrl
    if 'xkcd' not in comicUrl:
        comicUrl=comicUrl[:7]+'xkcd.com/'+comicUrl[7:]
    
    print "comic url",comicUrl
    
    0 讨论(0)
  • 2020-12-09 10:33

    Actually this is not a bigdeal.you can see the comicUrl somewhat like this //imgs.xkcd.com/comics/acceptable_risk.png

    The only thing you need to add is http: , remember it is http: and not http:// as some folks said earlier because already the url contatin double slashes. so please change the code to

    res = requests.get('http:' + comicElem[0].get('src'))
    

    or

    comicUrl = 'http:' + comicElem[0].get('src')
    
    res = requests.get(comicUrl)
    

    Happy coding

    0 讨论(0)
  • 2020-12-09 10:34

    Id just like to chime in here that I had this exact same error and used @Ajay recommended answer above but even after adding that I as still getting problems, right after the program downloaded the first image it would stop and return this error:

    ValueError: Unsupported or invalid CSS selector: "a[rel"
    

    this was referring to one of the last lines in the program where it uses the 'Prev button' to go to the next image to download.

    Anyway after going through the bs4 docs I made a slight change as follows and it seems to work just fine now:

    prevLink = soup.select('a[rel^="prev"]')[0]
    

    Someone else might run into the same problem so thought Id add this comment.

    0 讨论(0)
  • 2020-12-09 10:40

    No schema means you haven't supplied the http:// or https:// supply these and it will do the trick.

    Edit: Look at this URL string!:

    URL '//imgs.xkcd.com/comics/the_martian.png':

    0 讨论(0)
  • 2020-12-09 10:42

    Explanation:

    A few XKCD pages have special content that isn’t a simple image file. That’s fine; you can just skip those. If your selector doesn’t find any elements, then soup.select('#comic img') will return a blank list.

    Working Code:

    import requests,os,bs4,shutil
    
    url='http://xkcd.com'
    
    #making new folder
    if os.path.isdir('xkcd') == True:
        shutil.rmtree('xkcd')
    else:
        os.makedirs('xkcd')
    
    
    #scrapiing information
    while not url.endswith('#'):
        print('Downloading Page %s.....' %(url))
        res = requests.get(url)          #getting page
        res.raise_for_status()
        soup = bs4.BeautifulSoup(res.text)
    
        comicElem = soup.select('#comic img')     #getting img tag under  comic divison
        if comicElem == []:                        #if not found print error
            print('could not find comic image')
    
        else:
            try:
                comicUrl = 'http:' + comicElem[0].get('src')             #getting comic url and then downloading its image
                print('Downloading image %s.....' %(comicUrl))
                res = requests.get(comicUrl)
                res.raise_for_status()
    
            except requests.exceptions.MissingSchema:
            #skip if not a normal image file
                prev = soup.select('a[rel="prev"]')[0]
                url = 'http://xkcd.com' + prev.get('href')
                continue
    
            imageFile = open(os.path.join('xkcd',os.path.basename(comicUrl)),'wb')     #write  downloaded image to hard disk
            for chunk in res.iter_content(10000):
                imageFile.write(chunk)
            imageFile.close()
    
            #get previous link and update url
            prev = soup.select('a[rel="prev"]')[0]
            url = "http://xkcd.com" + prev.get('href')
    
    
    print('Done...')
    
    0 讨论(0)
提交回复
热议问题