问题
I'm having trouble getting the spider to follow the next page of ads without following every link it finds, eventually returning every craigslist page. I've played around with the rule as I know that's where the problem lies, but I either get just the first page, every page on craigslist, or nothing. Any help?
Here's my current code:
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
class PageSpider(CrawlSpider):
name = "cto"
allowed_domains = ["medford.craigslist.org"]
start_urls = ["http://medford.craigslist.org/cto/"]
rules = (
Rule(
SgmlLinkExtractor(allow_domains=("medford.craigslist.org", )),
callback='parse_page', follow=True
),
)
def parse_page(self, response):
hxs = HtmlXPathSelector(response)
rows = hxs.select('//div[@class="content"]/p[@class="row"]')
for row in rows:
item = CraigslistSampleItem()
link = row.xpath('.//span[@class="pl"]/a')
item['title'] = link.xpath("text()").extract()
item['link'] = link.xpath("@href").extract()
item['price'] = row.xpath('.//span[@class="l2"]/span[@class="price"]/text()').extract()
url = 'http://medford.craigslist.org{}'.format(''.join(item['link']))
yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)
def parse_item_page(self, response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['description'] = hxs.select('//section[@id="postingbody"]/text()').extract()
return item
回答1:
You should specify an allow argument of SgmlLinkExtractor:
allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
rules = (
Rule(SgmlLinkExtractor(allow='http://medford.craigslist.org/cto/'),
callback='parse_page', follow=True),
)
This will keep all links under http://medford.craigslist.org/cto/ url.
Hope that helps.
来源:https://stackoverflow.com/questions/22264141/recursive-scraping-craigslist-with-scrapy-and-python-2-7