Building a RESTful Flask API for Scrapy

前端 未结 2 1197
余生分开走
余生分开走 2020-12-09 13:35

The API should allow arbitrary HTTP get requests containing URLs the user wants scraped, and then Flask should return the results of the scrape.

The following code w

2条回答
  •  时光取名叫无心
    2020-12-09 14:08

    I think there is no a good way to create Flask-based API for Scrapy. Flask is not a right tool for that because it is not based on event loop. To make things worse, Twisted reactor (which Scrapy uses) can't be started/stopped more than once in a single thread.

    Let's assume there is no problem with Twisted reactor and you can start and stop it. It won't make things much better because your scrape_it function may block for an extended period of time, and so you will need many threads/processes.

    I think the way to go is to create an API using async framework like Twisted or Tornado; it will be more efficient than a Flask-based (or Django-based) solution because the API will be able to serve requests while Scrapy is running a spider.

    Scrapy is based on Twisted, so using twisted.web or https://github.com/twisted/klein can be more straightforward. But Tornado is also not hard because you can make it use Twisted event loop.

    There is a project called ScrapyRT which does something very similar to what you want to implement - it is an HTTP API for Scrapy. ScrapyRT is based on Twisted.

    As an examle of Scrapy-Tornado integration check Arachnado - here is an example on how to integrate Scrapy's CrawlerProcess with Tornado's Application.

    If you really want Flask-based API then it could make sense to start crawls in separate processes and/or use queue solution like Celery. This way you're loosing most of the Scrapy efficiency; if you go this way you can use requests + BeautifulSoup as well.

提交回复
热议问题