scrapy xpath selector repeats data

徘徊边缘 提交于 2019-11-27 08:17:50

问题


I am trying to extract the business name and address from each listing and export it to a -csv, but I am having problems with the output csv. I think bizs = hxs.select("//div[@class='listing_content']") may be causing the problems.

yp_spider.py

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from yp.items import Biz

class MySpider(BaseSpider):
    name = "ypages"
    allowed_domains = ["yellowpages.com"]
    start_urls = ["http://www.yellowpages.com/sanfrancisco/restaraunts"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        bizs = hxs.select("//div[@class='listing_content']")
        items = []

        for biz in bizs:
            item = Biz()
            item['name'] = biz.select("//h3/a/text()").extract()
            item['address'] = biz.select("//span[@class='street-address']/text()").extract()
            print item
            items.append(item)

items.py

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field

class Biz(Item):
    name = Field()
    address = Field()

    def __str__(self):
        return "Website: name=%s address=%s" %  (self.get('name'), self.get('address'))

The output from 'scrapy crawl ypages -o list.csv -t csv' is a long list of business names then locations and it repeats the same data several times.


回答1:


you should add one "." to select the relative xpath, and here is from scrapy document(http://doc.scrapy.org/en/0.16/topics/selectors.html)

At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all

elements from the document, not only those inside elements:

>>> for p in divs.select('//p') # this is wrong - gets all <p> from the whole document
>>>     print p.extract()

This is the proper way to do it (note the dot prefixing the .//p XPath):

>>> for p in divs.select('.//p') # extracts all <p> inside
>>>     print p.extract()


来源:https://stackoverflow.com/questions/14392186/scrapy-xpath-selector-repeats-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!