Scrapy: Parsing list items onto separate lines

后端未结

关注

 1  1475

Tried to adapt the answer to this question to my issue but not successfully.

Here\'s some example html code:



        
                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  无人及你        
                
              
                            
                2020-12-18 16:00
              
            
            
                                                                       
First, you are using results = hxs.select('//*[@id="content"]/div[1]') so

    results = hxs.select('//*[@id="content"]/div[1]')
    for result in results:
        ...


will loop on one div only, the first child div of <div id="content" class="clear">

Want you need is to loop on every <dl class="clear">...</dl> within this //*[@id="content"]/div[1] (it would probably be easier to maintain with //*[@id="content"]/div[@class="content"])

        results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl')


Second, in each loop iteration, you are using absolute XPath expressions (//div...)

result.select('//div/dl/dt[contains(text(), "...")]/following-sibling::dd[1]/text()')


this will select all dd following dt matching the text content starting from the document root node.

Look at this section in Scrapy docs for details.

You need to use relative XPath expressions -- relative within each result scope representing each dl, like dt[contains(text(),"Contact hours")]/following-sibling::dd[1]/text() or ./dt[contains(text(), "Contact hours")]/following-sibling::dd[1]/text(), 

The "practice" field however can still use an absolute XPath expression //h1/text(), but you could also have a variable practice set once, and use it in each WebhealthItem1() instance

        ...
        practice = hxs.select('//h1/text()').extract()
        for result in results:
            item = WebhealthItem1()
            ...
            item['practice'] = practice


Here's what your spider would look like with these changes:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from webhealth.items1 import WebhealthItem1

class WebhealthSpider(BaseSpider):

    name = "webhealth_content1"

    download_delay = 5

    allowed_domains = ["webhealth.co.nz"]
    start_urls = [
        "http://auckland.webhealth.co.nz/provider/service/view/914136/"
        ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        practice = hxs.select('//h1/text()').extract()
        items1 = []

        results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl')
        for result in results:
            item = WebhealthItem1()
            #item['url'] = result.select('//dl/a/@href').extract()
            item['practice'] = practice
            item['hours'] = map(unicode.strip,
                result.select('dt[contains(.," Contact hours")]/following-sibling::dd[1]/text()').extract())
            item['more_hours'] = map(unicode.strip,
                result.select('dt[contains(., "More information")]/following-sibling::dd[1]/text()').extract())
            item['physical_address'] = map(unicode.strip,
                result.select('dt[contains(., "Physical address")]/following-sibling::dd[1]/text()').extract())
            item['postal_address'] = map(unicode.strip,
                result.select('dt[contains(., "Postal address")]/following-sibling::dd[1]/text()').extract())
            item['postcode'] = map(unicode.strip,
                result.select('dt[contains(., "Postcode")]/following-sibling::dd[1]/text()').extract())
            item['district_town'] = map(unicode.strip,
                result.select('dt[contains(., "District/town")]/following-sibling::dd[1]/text()').extract())
            item['region'] = map(unicode.strip,
                result.select('dt[contains(., "Region")]/following-sibling::dd[1]/text()').extract())
            item['phone'] = map(unicode.strip,
                result.select('dt[contains(., "Phone")]/following-sibling::dd[1]/text()').extract())
            item['website'] = map(unicode.strip,
                result.select('dt[contains(., "Website")]/following-sibling::dd[1]/a/@href').extract())
            item['email'] = map(unicode.strip,
                result.select('dt[contains(., "Email")]/following-sibling::dd[1]/a/text()').extract())
            items1.append(item)
        return items1


I also created a Cloud9 IDE project with this code. You can play with it at https://c9.io/redapple/so_19309960
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复