问题
Im trying to scrape data from sciencedirect website. Im trying to automate the scarping process by accessing the journal issues one after the other by creating a list of xpaths and looping them. when im running the loop im unable to access the rest of the elements after accessing the first journal. This process worked for me on another website but not on this.
I also wanted to know is there any better way to access these elements apart from this process.
#Importing libraries
import requests
import os
import json
from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup
import time
import requests
from time import sleep
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#initializing the chromewebdriver|
driver=webdriver.Chrome(executable_path=r"C:/selenium/chromedriver.exe")
#website to be accessed
driver.get("https://www.sciencedirect.com/journal/journal-of-corporate-finance/issues")
#generating the list of xpaths to be accessed one after the other
issues=[]
for i in range(0,20):
docs=(str(i))
for j in range(1,7):
sets=(str(j))
con=("//*[@id=")+('"')+("0-accordion-panel-")+(docs)+('"')+("]/section/div[")+(sets)+("]/a")
issues.append(con)
#looping to access one issue after the other
for i in issues:
try:
hat=driver.find_element_by_xpath(i)
hat.click()
sleep(4)
driver.back()
except:
print("no more issues",i)
回答1:
To scrape data from sciencedirect website https://www.sciencedirect.com/journal/journal-of-corporate-finance/issues you can perform the following steps:
First open all the accordions.
Then open each issue in the adjustant TAB using Ctrl +
click()
.Next switch_to() the newly opened tab and scrape the required contents.
Code Block:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.common.keys import Keys options = webdriver.ChromeOptions() options.add_argument("start-maximized") options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe') driver.get('https://www.sciencedirect.com/journal/journal-of-corporate-finance/issues') accordions = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "li.accordion-panel.js-accordion-panel>button.accordion-panel-title>span"))) for accordion in accordions: ActionChains(driver).move_to_element(accordion).click(accordion).perform() issues = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.anchor.js-issue-item-link.text-m span.anchor-text"))) windows_before = driver.current_window_handle for issue in issues: ActionChains(driver).key_down(Keys.CONTROL).click(issue).key_up(Keys.CONTROL).perform() WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2)) windows_after = driver.window_handles new_window = [x for x in windows_after if x != windows_before][0] driver.switch_to_window(new_window) WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a#journal-title>span"))) print(WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, "//h2"))).get_attribute("innerHTML")) driver.close() driver.switch_to_window(windows_before) driver.quit()
Console Output:
Institutions, Governance and Finance in a Globally Connected Environment Volume 58 Corporate Governance in Multinational Enterprises . . .
References
You can find a couple of relevant detailed discussions in:
- How to open a link embeded in a webelement with in the main tab, in a new tab of the same window using Control + Click of Selenium Webdriver
- How to open multiple hrefs within a webtable to scrape through selenium
- WebScraping JavaScript-Rendered Content using Selenium in Python
- StaleElementReferenceException even after adding the wait while collecting the data from the wikipedia using web-scraping
- How to open each product within a website in a new tab for scraping using Selenium through Python
来源:https://stackoverflow.com/questions/59706039/unable-to-access-the-remaining-elements-by-xpaths-in-a-loop-after-accessing-the