Here is the HTML that I\'m trying to scrape:
I am trying to get the first instance of \'td\' under each \'tr\' using Selenium (beautifulsoup won\'t work for this
I took your code and simplified the structure and ran the test with minimal lines of code as follows:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement')
print(driver.page_source)
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.cr_dataTable tbody tr>td[class]")))])
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='cr_dataTable']//tbody//tr/td[@class]")))])
Similarly, as per your observation I have hit the same roadblock that my tests didn't yeild and results.
While inspecting the Page Source of the webpage it was observed that there is an EventListener within a which validates certain page metrics and some of them are:
window.utag_datawindow.utag_data.page_performancewindow.PerformanceTimingwindow.PerformanceObservernewrelicfirst-contentful-paintPage Source:
MET | MetLife Inc. Annual Income Statement - WSJ
This is a clear indication that the website is protected by vigorous Bot Management techniques and the navigation by Selenium driven WebDriver initiated Browsing Context gets detected and subsequently blocked.
You can find a relevant discussions in: