Scrapy Unit Testing

前端 未结 10 1537
逝去的感伤
逝去的感伤 2020-11-30 18:18

I\'d like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the \"scrapy crawl\" command I can run it through something

10条回答
  •  余生分开走
    2020-11-30 18:45

    I use Betamax to run test on real site the first time and keep http responses locally so that next tests run super fast after:

    Betamax intercepts every request you make and attempts to find a matching request that has already been intercepted and recorded.

    When you need to get latest version of site, just remove what betamax has recorded and re-run test.

    Example:

    from scrapy import Spider, Request
    from scrapy.http import HtmlResponse
    
    
    class Example(Spider):
        name = 'example'
    
        url = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'
    
        def start_requests(self):
            yield Request(self.url, self.parse)
    
        def parse(self, response):
            for href in response.xpath('//a/@href').extract():
                yield {'image_href': href}
    
    
    # Test part
    from betamax import Betamax
    from betamax.fixtures.unittest import BetamaxTestCase
    
    
    with Betamax.configure() as config:
        # where betamax will store cassettes (http responses):
        config.cassette_library_dir = 'cassettes'
        config.preserve_exact_body_bytes = True
    
    
    class TestExample(BetamaxTestCase):  # superclass provides self.session
    
        def test_parse(self):
            example = Example()
    
            # http response is recorded in a betamax cassette:
            response = self.session.get(example.url)
    
            # forge a scrapy response to test
            scrapy_response = HtmlResponse(body=response.content, url=example.url)
    
            result = example.parse(scrapy_response)
    
            self.assertEqual({'image_href': u'image1.html'}, result.next())
            self.assertEqual({'image_href': u'image2.html'}, result.next())
            self.assertEqual({'image_href': u'image3.html'}, result.next())
            self.assertEqual({'image_href': u'image4.html'}, result.next())
            self.assertEqual({'image_href': u'image5.html'}, result.next())
    
            with self.assertRaises(StopIteration):
                result.next()
    

    FYI, I discover betamax at pycon 2015 thanks to Ian Cordasco's talk.

提交回复
热议问题