Using InitSpider with splash: only parsing the login page?

前端未结

关注

 3  2293

旧时难觅i 2021-01-02 02:32

This is sort of a follow-up question to one I asked earlier.

I\'m trying to scrape a webpage which I have to login to reach first. But after authentication, the web

3条回答

失恋的感觉 (楼主)

2021-01-02 03:02

I don't think Splash alone would handle this particular case well.

Here is the working idea:

use selenium and PhantomJS headless browser to log into the website
pass the browser cookies from PhantomJS into Scrapy

The code:

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


class BboSpider(scrapy.Spider):
    name = "bbo"
    allowed_domains = ["bridgebase.com"]
    login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F"

    def start_requests(self):
        driver = webdriver.PhantomJS()
        driver.get(self.login_page)

        driver.find_element_by_id("username").send_keys("user")
        driver.find_element_by_id("password").send_keys("password")

        driver.find_element_by_name("submit").click()

        driver.save_screenshot("test.png")
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Click here for results of recent tournaments")))

        cookies = driver.get_cookies()
        driver.close()

        yield scrapy.Request("http://www.bridgebase.com/myhands/index.php", cookies=cookies)

    def parse(self, response):
        if "recent tournaments" in response.body:
            self.log("Login successful")
        else:
            self.log("Login failed")
        print(response.body)

Prints Login successful and the HTML of the "hands" page.

0 讨论(0)

查看其它3个回答