Recursive Scraping Craigslist with Scrapy and Python 2.7

陌路散爱 提交于 2019-12-24 11:33:41

问题


I'm having trouble getting the spider to follow the next page of ads without following every link it finds, eventually returning every craigslist page. I've played around with the rule as I know that's where the problem lies, but I either get just the first page, every page on craigslist, or nothing. Any help?

Here's my current code:

from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request

class PageSpider(CrawlSpider):
    name = "cto"
    allowed_domains = ["medford.craigslist.org"]
    start_urls = ["http://medford.craigslist.org/cto/"]

    rules = (
        Rule(
        SgmlLinkExtractor(allow_domains=("medford.craigslist.org", )),
        callback='parse_page', follow=True
        ),

    )

        def parse_page(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//div[@class="content"]/p[@class="row"]')

        for row in rows:
            item = CraigslistSampleItem()
            link = row.xpath('.//span[@class="pl"]/a')
            item['title'] = link.xpath("text()").extract()
            item['link'] = link.xpath("@href").extract()
            item['price'] = row.xpath('.//span[@class="l2"]/span[@class="price"]/text()').extract()

            url = 'http://medford.craigslist.org{}'.format(''.join(item['link']))
            yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)


    def parse_item_page(self, response):
        hxs = HtmlXPathSelector(response)

        item = response.meta['item']
        item['description'] = hxs.select('//section[@id="postingbody"]/text()').extract()
        return item

回答1:


You should specify an allow argument of SgmlLinkExtractor:

allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.

rules = (
    Rule(SgmlLinkExtractor(allow='http://medford.craigslist.org/cto/'), 
         callback='parse_page', follow=True),
)

This will keep all links under http://medford.craigslist.org/cto/ url.

Hope that helps.



来源:https://stackoverflow.com/questions/22264141/recursive-scraping-craigslist-with-scrapy-and-python-2-7

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!