Web scrapping with a double loop with selenium and using By.SELECTOR

匿名 (未验证) 提交于 2019-12-03 09:06:55

问题:

I am trying to get the pdf files from this website. I am trying to create a double loop so I can scroll over the years (Season) to get all the main pdf located in each year.

The line of code is not working is this one. The problem is, I can not make this line work (The one that is supposed to loop all over the years (Season):

for year in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#season a aria-valuetext"))):  year.click()  

This is the full code:

  os.chdir("C:..")     driver = webdriver.Chrome("chromedriver.exe")     wait = WebDriverWait(driver, 10)     driver.get("http://www.motogp.com/en/Results+Statistics/")     links = []       for year in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#season a aria-valuetext"))):      year.click()                                                                for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#event option"))):          item.click()          elem = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "padleft5")))          print(elem.get_attribute("href"))          links.append(elem.get_attribute("href"))          wait.until(EC.staleness_of(elem))      driver.quit() 

This is a previous post where I got help with the code above:

Scraping pdfs from this web

回答1:

The solution below should work for you. First, we iterate over the # of years in the CSS slider. Then we work the list using your code example. Added a sleep command because I kept getting a timeout.

CODE

from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.keys import Keys import time  driver = webdriver.Chrome("chromedriver.exe") wait = WebDriverWait(driver, 10) driver.get("http://www.motogp.com/en/Results+Statistics/")  slider = driver.find_element_by_xpath('//*[@id="handle_season"]')  for year in range(68):     wait.until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="event"]')))         for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#event option"))):         item.click()         elem = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "padleft5")))         print(elem.get_attribute("href"))         wait.until(EC.staleness_of(elem))      slider.send_keys(Keys.ARROW_LEFT)     time.sleep(1)   driver.quit() 

Result:



回答2:

If your working behind a firewall then a lot of times your EC’s won’t work. See if a time.sleep(10) function doesn’t get you past it, instead of an EC. Secondly, check the page_source before you run the EC... if you’re behind a firewall the HTML source code will tell you.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!