Using InitSpider with splash: only parsing the login page?

前端 未结 3 2294
旧时难觅i
旧时难觅i 2021-01-02 02:32

This is sort of a follow-up question to one I asked earlier.

I\'m trying to scrape a webpage which I have to login to reach first. But after authentication, the web

相关标签:
3条回答
  • 2021-01-02 03:02

    I don't think Splash alone would handle this particular case well.

    Here is the working idea:

    • use selenium and PhantomJS headless browser to log into the website
    • pass the browser cookies from PhantomJS into Scrapy

    The code:

    import scrapy
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    
    class BboSpider(scrapy.Spider):
        name = "bbo"
        allowed_domains = ["bridgebase.com"]
        login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F"
    
        def start_requests(self):
            driver = webdriver.PhantomJS()
            driver.get(self.login_page)
    
            driver.find_element_by_id("username").send_keys("user")
            driver.find_element_by_id("password").send_keys("password")
    
            driver.find_element_by_name("submit").click()
    
            driver.save_screenshot("test.png")
            WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Click here for results of recent tournaments")))
    
            cookies = driver.get_cookies()
            driver.close()
    
            yield scrapy.Request("http://www.bridgebase.com/myhands/index.php", cookies=cookies)
    
        def parse(self, response):
            if "recent tournaments" in response.body:
                self.log("Login successful")
            else:
                self.log("Login failed")
            print(response.body)
    

    Prints Login successful and the HTML of the "hands" page.

    0 讨论(0)
  • Update

    So, it seems that start_requests fires before the login.

    Here is the code from InitSpider, minus comments.

    class InitSpider(Spider):
        def start_requests(self):
            self._postinit_reqs = super(InitSpider, self).start_requests()
            return iterate_spider_output(self.init_request())
    
        def initialized(self, response=None):
            return self.__dict__.pop('_postinit_reqs')
    
        def init_request(self):
            return self.initialized()
    

    InitSpider calls the main start_requests with initialized.

    Your start_requests is a modified version of the base class's method. So maybe something like this will work.

    from scrapy.utils.spider import iterate_spider_output
    
    ...
    
    def start_requests(self):
        self._postinit_reqs = my_start_requests()
        return iterate_spider_output(self.init_request())
    
    def my_start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            }) 
    

    You need to return self.initialized()

    0 讨论(0)
  • 2021-01-02 03:06

    You can get all the data without the need for js at all, there are links available for browsers that do not have javascript enabled, the urls are the same bar ?offset=0. You just need to parse the queries from the tourney url you are interested in and create a Formrequest.

    import scrapy
    from scrapy.spiders.init import InitSpider
    from urlparse import parse_qs, urlparse
    
    
    class BboSpider(InitSpider):
        name = "bbo"
        allowed_domains = ["bridgebase.com"]
        start_urls = [
            "http://www.bridgebase.com/myhands/index.php"
        ]
    
        login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F"
    
        def start_requests(self):
            return [scrapy.FormRequest(self.login_page,
                                       formdata={'username': 'foo', 'password': 'bar'}, callback=self.parse)]
    
        def parse(self, response):
            yield scrapy.Request("http://www.bridgebase.com/myhands/index.php?offset=0", callback=self.get_all_tournaments)
    
        def get_all_tournaments(self, r):
            url = r.xpath("//a/@href[contains(., 'tourneyhistory')]").extract_first()
            yield scrapy.Request(url, callback=self.chosen_tourney)
    
        def chosen_tourney(self, r):
            url = r.xpath("//a[contains(./text(),'Speedball')]/@href").extract_first()
            query = urlparse(url).query
            yield scrapy.FormRequest("http://webutil.bridgebase.com/v2/tarchive.php?offset=0", callback=self.get_tourney_data_links,
                                     formdata={k: v[0] for k, v in parse_qs(query).items()})
    
        def get_tourney_data_links(self, r):
            print r.xpath("//a/@href").extract()
    

    There are numerous links in the output, for hands you get the tview.php?-t=...., you can request each one joining to http://webutil.bridgebase.com/v2/ and it will give you a table of all the data that is easy to parse, there are also links to tourney=4796-1455303720-&username=... associated with each hand in the tables, a snippet of the output from the tview link:

    class="bbo_tr_t">
        <table class="bbo_t_l">
        <tr><td class="bbo_tll" align="left">Title</td><td class="bbo_tlv">#4796 Ind.  ACBL Fri 2pm</td></tr>
        <tr><td class="bbo_tll" align="left">Host</td><td class="bbo_tlv">ACBL</td></tr>
        <tr><td class="bbo_tll" align="left">Tables</td><td class="bbo_tlv">9</td></tr>
    
    
    
        </table>
    
        </div><div class='sectionbreak'>Section 1 </div><div class='onesection'> <table class='sectiontable' ><tr><th>Name</th><th>Score (IMPs)</th><th class='rank'>Rank</th><th>Prize</th><th>Points</th></tr>
    <tr class='odd'><td>colt22</td><td><a href="http://www.bridgebase.com/myhands/hands.php?tourney=4796-1455303720-&username=colt22" target="_blank">42.88</a></td><td class='rank' >1</td><td></td><td>0.90</td></tr>
    <tr class='even'><td>francha</td><td><a href="http://www.bridgebase.com/myhands/hands.php?tourney=4796-1455303720-&username=francha" target="_blank">35.52</a></td><td class='rank' >2</td><td></td><td>0.63</td></tr>
    <tr class='odd'><td>MSMK</td><td><a href="http://www.bridgebase.com/myhands/hands.php?tourney=4796-1455303720-&username=MSMK" target="_blank">34.38</a></td><td class='rank' >3</td><td></td><td>0.45</td></tr>
    

    The rest of the parsing I will leave to yourself.

    0 讨论(0)
提交回复
热议问题