Using InitSpider with splash: only parsing the login page?

前端 未结 3 2293
旧时难觅i
旧时难觅i 2021-01-02 02:32

This is sort of a follow-up question to one I asked earlier.

I\'m trying to scrape a webpage which I have to login to reach first. But after authentication, the web

3条回答
  •  失恋的感觉
    2021-01-02 03:02

    I don't think Splash alone would handle this particular case well.

    Here is the working idea:

    • use selenium and PhantomJS headless browser to log into the website
    • pass the browser cookies from PhantomJS into Scrapy

    The code:

    import scrapy
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    
    class BboSpider(scrapy.Spider):
        name = "bbo"
        allowed_domains = ["bridgebase.com"]
        login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F"
    
        def start_requests(self):
            driver = webdriver.PhantomJS()
            driver.get(self.login_page)
    
            driver.find_element_by_id("username").send_keys("user")
            driver.find_element_by_id("password").send_keys("password")
    
            driver.find_element_by_name("submit").click()
    
            driver.save_screenshot("test.png")
            WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Click here for results of recent tournaments")))
    
            cookies = driver.get_cookies()
            driver.close()
    
            yield scrapy.Request("http://www.bridgebase.com/myhands/index.php", cookies=cookies)
    
        def parse(self, response):
            if "recent tournaments" in response.body:
                self.log("Login successful")
            else:
                self.log("Login failed")
            print(response.body)
    

    Prints Login successful and the HTML of the "hands" page.

提交回复
热议问题