Scrapy Unit Testing

前端未结

关注

 10  1537

逝去的感伤 2020-11-30 18:18

I\'d like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the \"scrapy crawl\" command I can run it through something

10条回答

余生分开走 (楼主)

2020-11-30 18:45

I use Betamax to run test on real site the first time and keep http responses locally so that next tests run super fast after:

Betamax intercepts every request you make and attempts to find a matching request that has already been intercepted and recorded.

When you need to get latest version of site, just remove what betamax has recorded and re-run test.

Example:

from scrapy import Spider, Request
from scrapy.http import HtmlResponse


class Example(Spider):
    name = 'example'

    url = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'

    def start_requests(self):
        yield Request(self.url, self.parse)

    def parse(self, response):
        for href in response.xpath('//a/@href').extract():
            yield {'image_href': href}


# Test part
from betamax import Betamax
from betamax.fixtures.unittest import BetamaxTestCase


with Betamax.configure() as config:
    # where betamax will store cassettes (http responses):
    config.cassette_library_dir = 'cassettes'
    config.preserve_exact_body_bytes = True


class TestExample(BetamaxTestCase):  # superclass provides self.session

    def test_parse(self):
        example = Example()

        # http response is recorded in a betamax cassette:
        response = self.session.get(example.url)

        # forge a scrapy response to test
        scrapy_response = HtmlResponse(body=response.content, url=example.url)

        result = example.parse(scrapy_response)

        self.assertEqual({'image_href': u'image1.html'}, result.next())
        self.assertEqual({'image_href': u'image2.html'}, result.next())
        self.assertEqual({'image_href': u'image3.html'}, result.next())
        self.assertEqual({'image_href': u'image4.html'}, result.next())
        self.assertEqual({'image_href': u'image5.html'}, result.next())

        with self.assertRaises(StopIteration):
            result.next()

FYI, I discover betamax at pycon 2015 thanks to Ian Cordasco's talk.

0 讨论(0)

查看其它10个回答