StaleElementReferenceException even after adding the wait while collecting the data from the wikipedia using web-scraping

心已入冬 提交于 2021-02-02 09:56:26

问题


I am a newbie to the web-scraping. Pardon my silly mistakes if there are any.

I have been working on a project in which I need a list of movies as my data. I am trying to collect the data from the wikipedia using web-scraping.

Following is my code for the same:

def MoviesList(years, driver):
    for year in years:
        driver.implicitly_wait(150)
        year.click()
        table = driver.find_element_by_xpath('/html/body/div[3]/div[3]/div[5]/div[1]/table[2]/tbody')
        movies = table.find_elements_by_xpath('tr/td[1]/i/a')
        for movie in movies:
            print(movie.text)
        driver.back()
years = driver.find_elements_by_partial_link_text('List of Bollywood films of')
del years[:2]
MoviesList(years, driver)

Trying to get the years list from this page and stored it in the years variable. Then, I am looping through all the years and trying to extract the top-10 movies of the year. see this for reference

Output:

Tanhaji
Baaghi 3
...
...
Panga
# Top movies of the year 2020
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document (from line year.click())

Expected Output:

Tanhaji  
...
...
War  # First movie of the year 2019
Saaho
...
...
Vikram Urvashi  # Last movie of the year 1920
# Top movies of the year from 2020 to 1920

I have already referred this and this questions but it goes in vain. I have tried Explicit Wait too, but it didn't work.

I am aware of the error that when it occurs but I don't know how to handle that error other than adding implicit or explicit wait.

What am I doing wrong? How can I improve this code to get the desired output?

Any help would be much appreciated.


回答1:


To collect the data from the wikipedia Lists of Bollywood films using Selenium and python you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use the following Locator Strategies:

Note: As a demonstration this program is restricted to collect the movies from the Highest worldwide gross section for the previous three(3) years only

  • Code Block:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
    driver.get("https://en.wikipedia.org/wiki/Lists_of_Bollywood_films")
    parent_window  = driver.current_window_handle
    years = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.PARTIAL_LINK_TEXT, "List of Bollywood films of")))[2:5]]
    print(years)
    for year in years:
        driver.execute_script("window.open('" + year +"')")
        WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
        windows_after = driver.window_handles
        new_window = [x for x in windows_after if x != parent_window][0]
        driver.switch_to_window(new_window)
        print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table/caption//following::tbody[1]//td/i/a")))])
        driver.close()
        driver.switch_to_window(parent_window)
    driver.quit()
    
  • Console Output:

    ['Tanhaji', 'Baaghi 3', 'Street Dancer 3D', 'Shubh Mangal Zyada Saavdhan', 'Malang', 'Chhapaak', 'Love Aaj Kal', 'Jawaani Jaaneman', 'Thappad', 'Panga']
    ['War', 'Saaho', 'Kabir Singh', 'Uri: The Surgical Strike', 'Bharat', 'Good Newwz', 'Mission Mangal', 'Housefull 4', 'Gully Boy', 'Dabangg 3']
    ['Sanju', 'Padmaavat', 'Andhadhun', 'Simmba', 'Thugs of Hindostan', 'Race 3', 'Baaghi 2', 'Hichki', 'Badhaai Ho', 'Pad Man']
    

References

You can find a couple of relevant detailed discussions in:

  • How to open multiple hrefs within a webtable to scrape through selenium
  • WebScraping JavaScript-Rendered Content using Selenium in Python
  • Unable to access the remaining elements by xpaths in a loop after accessing the first element- Webscraping Selenium Python
  • How to open each product within a website in a new tab for scraping using Selenium through Python



回答2:


def MoviesList(linktext, driver):
    count = 0
    while(len(years)!=count):
        years = driver.find_elements_by_partial_link_text(linktext)
        del years[:2]
        year = years[count]
        count+=1
        driver.implicitly_wait(150)
        year.click()
        table = driver.find_element_by_xpath('/html/body/div[3]/div[3]/div[5]/div[1]/table[2]/tbody')
        movies = table.find_elements_by_xpath('tr/td[1]/i/a')
        for movie in movies:
            print(movie.text)
        driver.back()


MoviesList('List of Bollywood films of', driver)

you should always find years again , as you are clicking 'year' and this modifies the DOM, when ever DOM ( PAge html) is modified you have find all elements again as the previous reference is lost , this is why you get stale element



来源:https://stackoverflow.com/questions/65623799/staleelementreferenceexception-even-after-adding-the-wait-while-collecting-the-d

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!