crawl site that has infinite scrolling using python

后端 未结 4 1821
面向向阳花
面向向阳花 2020-12-09 14:11

I have been doing research and so far I found out the python package that I will plan on using its scrapy, now I am trying to find out what is a good way to build a scraper

相关标签:
4条回答
  • 2020-12-09 14:18

    You can use selenium to scrap the infinite scrolling website like twitter or facebook.

    Step 1 : Install Selenium using pip

    pip install selenium 
    

    Step 2 : use the code below to automate infinite scroll and extract the source code

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.support.ui import Select
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.common.exceptions import TimeoutException
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.common.exceptions import NoSuchElementException
    from selenium.common.exceptions import NoAlertPresentException
    import sys
    
    import unittest, time, re
    
    class Sel(unittest.TestCase):
        def setUp(self):
            self.driver = webdriver.Firefox()
            self.driver.implicitly_wait(30)
            self.base_url = "https://twitter.com"
            self.verificationErrors = []
            self.accept_next_alert = True
        def test_sel(self):
            driver = self.driver
            delay = 3
            driver.get(self.base_url + "/search?q=stackoverflow&src=typd")
            driver.find_element_by_link_text("All").click()
            for i in range(1,100):
                self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                time.sleep(4)
            html_source = driver.page_source
            data = html_source.encode('utf-8')
    
    
    if __name__ == "__main__":
        unittest.main()
    

    The for loop allows you to parse through the infinite scrolls and post which you can extract the loaded data.

    Step 3 : Print the data if required.

    0 讨论(0)
  • 2020-12-09 14:25
    from selenium.webdriver.common.keys import Keys
    import selenium.webdriver
    driver = selenium.webdriver.Firefox()
    driver.get("http://www.something.com")
    lastElement = driver.find_elements_by_id("someId")[-1]
    lastElement.send_keys(Keys.NULL)
    

    This will open a page, find the bottom-most element with the given id and the scroll that element into view. You'll have to keep querying the driver to get the last element as the page loads more, and I've found this to be pretty slow as pages get large. The time is dominated by the call to driver.find_element_* because I don't know of a way to explicitly query the last element in the page.

    Through experimentation you might find there is an upper limit to the amount of elements the page loads dynamically, and it would be best if you wrote something that loaded that number and only then made a call to driver.find_element_*.

    0 讨论(0)
  • 2020-12-09 14:31

    This is short & simple code which is working for me:

    SCROLL_PAUSE_TIME = 20
    
    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")
    
    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)
    
        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height
    
    posts = driver.find_elements_by_class_name("post-text")
    
    for block in posts:
        print(block.text)
    
    0 讨论(0)
  • 2020-12-09 14:44

    For infinite scrolling data are requested to Ajax calls. Open web browser --> network_tab --> clear previous requests history by clicking icon like stop--> scroll the webpage--> now you can find the new request for scroll event--> open the request header --> you can find the URL of request ---> copy and paste URL in an seperare tab--> you can find the result of Ajax call --> just form the requested URL to get the data page until end of the page

    0 讨论(0)
提交回复
热议问题