My script parses all the links again and again from a infinite scrolling webpage

天涯浪子 提交于 2019-12-25 13:39:08

问题


I've written a script using python in combination with selenium to get all the company links from a webpage which doesn't display all the links until scrolled downmost. However, when I run my script, I get desired links but there are lots of duplicates being scraped along. At this point, I can't get any idea how can I modify my script to get the unique links. Here is what I've tried so far:

from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://fortune.com/fortune500/list/')
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)

    for items in driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]"):
        item = items.find_elements_by_xpath('.//a')[0]
        print(item.get_attribute("href"))

driver.close()

回答1:


I don't know python but I do know what you are doing wrong. Hopefully you'll be able to figure out the code for yourself ;)

Every time you scroll down 50 links are added to the page until there are 1000 links. Well almost... it starts with 20 links and then adds 30 and then 50 each time until there are 1000.

The way your code is now you are printing of:

The 1st 20 links.

The 1st 20 again + the next 30.

The 1st 50 + the next 50.

And so on...

What you actually want to do is just scroll down the page until you have all the links on the page and then print them. Hope that helps.

Here's the updated Python code (I've checked it and it works)

from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://fortune.com/fortune500/list/')


while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)
    listElements = driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]//a")
    print(len(listElements))
    if (len(listElements) == 1000):
        break

for item in listElements:
    print(item.get_attribute("href"))

driver.close()

If you want it to work a bit faster you could swap out the "time.sleep(5)" for Anderson's wait statement




回答2:


You can try below code:

from selenium.webdriver.support.ui import WebDriverWait as wait 
from selenium.common.exceptions import TimeoutException

my_links = []
while True:
    try:
        current_length = len(my_links)
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        wait(driver, 10).until(lambda: len(driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]//a")) > current_length)
        my_links.extend([a.get_attribute("href") for a in driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]//a")])
    except TimeoutException:
        break

my_links = set(my_links)

This should allow you to scroll down and collect new links while it's possible. Finally with set() you can leave only unique values



来源:https://stackoverflow.com/questions/44894099/my-script-parses-all-the-links-again-and-again-from-a-infinite-scrolling-webpage

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!