Can scrapy be used to scrape dynamic content from websites that are using AJAX?

前端 未结 8 887
星月不相逢
星月不相逢 2020-11-21 17:48

I have recently been learning Python and am dipping my hand into building a web-scraper. It\'s nothing fancy at all; its only purpose is to get the data off of a betting we

8条回答
  •  没有蜡笔的小新
    2020-11-21 18:18

    Another solution would be to implement a download handler or download handler middleware. (see scrapy docs for more information on downloader middleware) The following is an example class using selenium with headless phantomjs webdriver:

    1) Define class within the middlewares.py script.

    from selenium import webdriver
    from scrapy.http import HtmlResponse
    
    class JsDownload(object):
    
        @check_spider_middleware
        def process_request(self, request, spider):
            driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
            driver.get(request.url)
            return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))
    

    2) Add JsDownload() class to variable DOWNLOADER_MIDDLEWARE within settings.py:

    DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}
    

    3) Integrate the HTMLResponse within your_spider.py. Decoding the response body will get you the desired output.

    class Spider(CrawlSpider):
        # define unique name of spider
        name = "spider"
    
        start_urls = ["https://www.url.de"] 
    
        def parse(self, response):
            # initialize items
            item = CrawlerItem()
    
            # store data as items
            item["js_enabled"] = response.body.decode("utf-8") 
    

    Optional Addon:
    I wanted the ability to tell different spiders which middleware to use so I implemented this wrapper:

    def check_spider_middleware(method):
    @functools.wraps(method)
    def wrapper(self, request, spider):
        msg = '%%s %s middleware step' % (self.__class__.__name__,)
        if self.__class__ in spider.middleware:
            spider.log(msg % 'executing', level=log.DEBUG)
            return method(self, request, spider)
        else:
            spider.log(msg % 'skipping', level=log.DEBUG)
            return None
    
    return wrapper
    

    for wrapper to work all spiders must have at minimum:

    middleware = set([])
    

    to include a middleware:

    middleware = set([MyProj.middleware.ModuleName.ClassName])
    

    Advantage:
    The main advantage to implementing it this way rather than in the spider is that you only end up making one request. In A T's solution for example: The download handler processes the request and then hands off the response to the spider. The spider then makes a brand new request in it's parse_page function -- That's two requests for the same content.

提交回复
热议问题