I\'d like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the \"scrapy crawl\" command I can run it through something
I use Betamax to run test on real site the first time and keep http responses locally so that next tests run super fast after:
Betamax intercepts every request you make and attempts to find a matching request that has already been intercepted and recorded.
When you need to get latest version of site, just remove what betamax has recorded and re-run test.
Example:
from scrapy import Spider, Request
from scrapy.http import HtmlResponse
class Example(Spider):
name = 'example'
url = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'
def start_requests(self):
yield Request(self.url, self.parse)
def parse(self, response):
for href in response.xpath('//a/@href').extract():
yield {'image_href': href}
# Test part
from betamax import Betamax
from betamax.fixtures.unittest import BetamaxTestCase
with Betamax.configure() as config:
# where betamax will store cassettes (http responses):
config.cassette_library_dir = 'cassettes'
config.preserve_exact_body_bytes = True
class TestExample(BetamaxTestCase): # superclass provides self.session
def test_parse(self):
example = Example()
# http response is recorded in a betamax cassette:
response = self.session.get(example.url)
# forge a scrapy response to test
scrapy_response = HtmlResponse(body=response.content, url=example.url)
result = example.parse(scrapy_response)
self.assertEqual({'image_href': u'image1.html'}, result.next())
self.assertEqual({'image_href': u'image2.html'}, result.next())
self.assertEqual({'image_href': u'image3.html'}, result.next())
self.assertEqual({'image_href': u'image4.html'}, result.next())
self.assertEqual({'image_href': u'image5.html'}, result.next())
with self.assertRaises(StopIteration):
result.next()
FYI, I discover betamax at pycon 2015 thanks to Ian Cordasco's talk.