I'm writing a scrapy spider to crawl for today's NYT articles from the homepage, but for some reason it doesn't follow any links. When I instantiate the link extractor in scrapy shell http://www.nytimes.com
, it successfully extracts a list of article urls with le.extract_links(response)
, but I can't get my crawl command (scrapy crawl nyt -o out.json
) to scrape anything but the homepage. I'm sort of at my wit's end. Is it because the homepage does not yield an article from the parse function? Any help is greatly appreciated.
from datetime import date
import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor
from ..items import NewsArticle
with open('urls/debug/nyt.txt') as debug_urls:
debug_urls = debug_urls.readlines()
with open('urls/release/nyt.txt') as release_urls:
release_urls = release_urls.readlines() # ["http://www.nytimes.com"]
today = date.today().strftime('%Y/%m/%d')
print today
class NytSpider(scrapy.Spider):
name = "nyt"
allowed_domains = ["nytimes.com"]
start_urls = release_urls
rules = (
Rule(LinkExtractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )),
callback='parse', follow=True),
)
def parse(self, response):
article = NewsArticle()
for story in response.xpath('//article[@id="story"]'):
article['url'] = response.url
article['title'] = story.xpath(
'//h1[@id="story-heading"]/text()').extract()
article['author'] = story.xpath(
'//span[@class="byline-author"]/@data-byline-name'
).extract()
article['published'] = story.xpath(
'//time[@class="dateline"]/@datetime').extract()
article['content'] = story.xpath(
'//div[@id="story-body"]/p//text()').extract()
yield article
I have found the solution to my problem. I was doing 2 things wrong:
- I needed to subclass
CrawlSpider
rather thanSpider
if I wanted it to automatically crawl sublinks. - When using
CrawlSpider
, I needed to use a callback function rather than overridingparse
. As per the docs, overridingparse
breaksCrawlSpider
functionality.
来源:https://stackoverflow.com/questions/30922218/scrapy-spider-not-following-links