BeautifulSoup select all href in some element with specific class

。_饼干妹妹 提交于 2019-12-24 09:29:39

问题


I'm trying to scrap images from this website. I tried with Scrapy(using Docker)and with scrapy/slenium. Scrapy seems not to work in windows10 home so I'm now trying with Selenium/Beautifulsoup. I'm using Python 3.6 with Spider into an Anaconda env.

This is how the href elements I need look like:

<a class="emblem" href="detail/emblem/av1615001">

I have to major problems:
- how should I select href with Beautifulsoup? Below in my code, you can see what I tried (but didn't work)
- As it is possible to observe the href is only a partial path to url...how should I deal with this issue?

Here my code so far:

from bs4 import BeautifulSoup
from time import sleep
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import urllib 
import requests
from os.path  import basename


def start_requests(self):
        self.driver = webdriver.Firefox("C:/Anaconda3/envs/scrapy/selenium/webdriver")
        #programPause = input("Press the <ENTER> key to continue...")
        self.driver.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18")
        html = self.driver.page_source

        #html = requests.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18")
        soup = BeautifulSoup(html, "html.parser")        
        emblemshref = soup.select("a", {"class" : "emblem", "href" : True})

        for href in emblemshref:
            link = href["href"]
            with open(basename(link)," wb") as f:
                f.write(requests.get(link).content)

        #click on "next>>"         
        while True:
            try:
                next_page = self.driver.find_element_by_xpath("//a[@id='next']")
                sleep(3)
                self.logger.info('Sleeping for 3 seconds')
                next_page.click()

                #here again the same emblemshref loop 

            except NoSuchElementException:
                #execute next on the last page
                self.logger.info('No more pages to load') 
                self.driver.quit()
                break 

回答1:


Try this. It will give you all the urls traversing all the pages in that site. I've used Explicit Wait to make it faster and dynamic.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
url = "http://emblematica.grainger.illinois.edu/"
wait = WebDriverWait(driver, 10)
driver.get("http://emblematica.grainger.illinois.edu/browse/emblems?Filter.Collection=Utrecht&Skip=0&Take=18")
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".emblem")))

while True:
    soup = BeautifulSoup(driver.page_source,"lxml")
    for item in soup.select('.emblem'):
        links = url + item['href']
        print(links)

    try:
        link = driver.find_element_by_id("next")
        link.click()
        wait.until(EC.staleness_of(link))
    except Exception:
        break
driver.quit()

Partial output:

http://emblematica.grainger.illinois.edu/detail/emblem/av1615001
http://emblematica.grainger.illinois.edu/detail/emblem/av1615002
http://emblematica.grainger.illinois.edu/detail/emblem/av1615003



回答2:


you can get href by class name as:

que1:

for link in soup.findAll('a', {'class': 'emblem'}):
   try:
      print link['href']
   except KeyError:
      pass`



回答3:


Not sure if above answers did the job. Here is one which does the work for me.

url = "SOME-URL-YOU-WANT-TO-SCRAPE"
response = requests.get(url=url)
urls = BeautifulSoup(response.content, 'lxml').find_all('a', attrs={"class": ["YOUR-CLASS-NAME"]}, href=True)


来源:https://stackoverflow.com/questions/47653309/beautifulsoup-select-all-href-in-some-element-with-specific-class

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!