Scrapy: Parsing list items onto separate lines

后端 未结 1 1475
死守一世寂寞
死守一世寂寞 2020-12-18 15:29

Tried to adapt the answer to this question to my issue but not successfully.

Here\'s some example html code:

相关标签:
1条回答
  • 2020-12-18 16:00

    First, you are using results = hxs.select('//*[@id="content"]/div[1]') so

        results = hxs.select('//*[@id="content"]/div[1]')
        for result in results:
            ...
    

    will loop on one div only, the first child div of <div id="content" class="clear">

    Want you need is to loop on every <dl class="clear">...</dl> within this //*[@id="content"]/div[1] (it would probably be easier to maintain with //*[@id="content"]/div[@class="content"])

            results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl')
    

    Second, in each loop iteration, you are using absolute XPath expressions (//div...)

    result.select('//div/dl/dt[contains(text(), "...")]/following-sibling::dd[1]/text()')
    

    this will select all dd following dt matching the text content starting from the document root node.

    Look at this section in Scrapy docs for details.

    You need to use relative XPath expressions -- relative within each result scope representing each dl, like dt[contains(text(),"Contact hours")]/following-sibling::dd[1]/text() or ./dt[contains(text(), "Contact hours")]/following-sibling::dd[1]/text(),

    The "practice" field however can still use an absolute XPath expression //h1/text(), but you could also have a variable practice set once, and use it in each WebhealthItem1() instance

            ...
            practice = hxs.select('//h1/text()').extract()
            for result in results:
                item = WebhealthItem1()
                ...
                item['practice'] = practice
    

    Here's what your spider would look like with these changes:

    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    from webhealth.items1 import WebhealthItem1
    
    class WebhealthSpider(BaseSpider):
    
        name = "webhealth_content1"
    
        download_delay = 5
    
        allowed_domains = ["webhealth.co.nz"]
        start_urls = [
            "http://auckland.webhealth.co.nz/provider/service/view/914136/"
            ]
    
        def parse(self, response):
            hxs = HtmlXPathSelector(response)
    
            practice = hxs.select('//h1/text()').extract()
            items1 = []
    
            results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl')
            for result in results:
                item = WebhealthItem1()
                #item['url'] = result.select('//dl/a/@href').extract()
                item['practice'] = practice
                item['hours'] = map(unicode.strip,
                    result.select('dt[contains(.," Contact hours")]/following-sibling::dd[1]/text()').extract())
                item['more_hours'] = map(unicode.strip,
                    result.select('dt[contains(., "More information")]/following-sibling::dd[1]/text()').extract())
                item['physical_address'] = map(unicode.strip,
                    result.select('dt[contains(., "Physical address")]/following-sibling::dd[1]/text()').extract())
                item['postal_address'] = map(unicode.strip,
                    result.select('dt[contains(., "Postal address")]/following-sibling::dd[1]/text()').extract())
                item['postcode'] = map(unicode.strip,
                    result.select('dt[contains(., "Postcode")]/following-sibling::dd[1]/text()').extract())
                item['district_town'] = map(unicode.strip,
                    result.select('dt[contains(., "District/town")]/following-sibling::dd[1]/text()').extract())
                item['region'] = map(unicode.strip,
                    result.select('dt[contains(., "Region")]/following-sibling::dd[1]/text()').extract())
                item['phone'] = map(unicode.strip,
                    result.select('dt[contains(., "Phone")]/following-sibling::dd[1]/text()').extract())
                item['website'] = map(unicode.strip,
                    result.select('dt[contains(., "Website")]/following-sibling::dd[1]/a/@href').extract())
                item['email'] = map(unicode.strip,
                    result.select('dt[contains(., "Email")]/following-sibling::dd[1]/a/text()').extract())
                items1.append(item)
            return items1
    

    I also created a Cloud9 IDE project with this code. You can play with it at https://c9.io/redapple/so_19309960

    0 讨论(0)
提交回复
热议问题