Using InitSpider with splash: only parsing the login page?

前端未结

关注

 3  2312

旧时难觅i 2021-01-02 02:32

This is sort of a follow-up question to one I asked earlier.

I\'m trying to scrape a webpage which I have to login to reach first. But after authentication, the web

3条回答

死守一世寂寞 (楼主)

2021-01-02 03:06

You can get all the data without the need for js at all, there are links available for browsers that do not have javascript enabled, the urls are the same bar ?offset=0. You just need to parse the queries from the tourney url you are interested in and create a Formrequest.

import scrapy
from scrapy.spiders.init import InitSpider
from urlparse import parse_qs, urlparse


class BboSpider(InitSpider):
    name = "bbo"
    allowed_domains = ["bridgebase.com"]
    start_urls = [
        "http://www.bridgebase.com/myhands/index.php"
    ]

    login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F"

    def start_requests(self):
        return [scrapy.FormRequest(self.login_page,
                                   formdata={'username': 'foo', 'password': 'bar'}, callback=self.parse)]

    def parse(self, response):
        yield scrapy.Request("http://www.bridgebase.com/myhands/index.php?offset=0", callback=self.get_all_tournaments)

    def get_all_tournaments(self, r):
        url = r.xpath("//a/@href[contains(., 'tourneyhistory')]").extract_first()
        yield scrapy.Request(url, callback=self.chosen_tourney)

    def chosen_tourney(self, r):
        url = r.xpath("//a[contains(./text(),'Speedball')]/@href").extract_first()
        query = urlparse(url).query
        yield scrapy.FormRequest("http://webutil.bridgebase.com/v2/tarchive.php?offset=0", callback=self.get_tourney_data_links,
                                 formdata={k: v[0] for k, v in parse_qs(query).items()})

    def get_tourney_data_links(self, r):
        print r.xpath("//a/@href").extract()

There are numerous links in the output, for hands you get the tview.php?-t=...., you can request each one joining to http://webutil.bridgebase.com/v2/ and it will give you a table of all the data that is easy to parse, there are also links to tourney=4796-1455303720-&username=... associated with each hand in the tables, a snippet of the output from the tview link:

class="bbo_tr_t">
    
    Title #4796 Ind.  ACBL Fri 2pm
    Host ACBL
    Tables 9

Section 1

 The rest of the parsing I will leave to yourself.

                                                        

              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            

                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                

          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复
            
          
        

          
 
     
 
        热议问题
            
                 
Name Score (IMPs) Rank Prize Points
colt22 42.88 1 0.90
francha 35.52 2 0.63
MSMK 34.38 3 0.45