Can Scrapy be replaced by pyspider?

后端 未结 2 1180
旧巷少年郎
旧巷少年郎 2020-12-24 07:03

I\'ve been using Scrapy web-scraping framework pretty extensively, but, recently I\'ve discovered that there is another framework/system called pyspider, which,

相关标签:
2条回答
  • 2020-12-24 07:42

    pyspider and Scrapy have the same purpose, web scraping, but a different view about doing that.

    • spider should never stop till WWW dead. (information is changing, data is updating in websites, spider should have the ability and responsibility to scrape latest data. That's why pyspider has URL database, powerful scheduler, @every, age, etc..)

    • pyspider is a service more than a framework. (Components are running in isolated process, lite - all version is running as service too, you needn't have a Python environment but a browser, everything about fetch or schedule is controlled by script via API not startup parameters or global configs, resources/projects is managed by pyspider, etc...)

    • pyspider is a spider system. (Any components can been replaced, even developed in C/C++/Java or any language, for better performance or larger capacity)

    and

    • on_start vs start_url
    • token bucket traffic control vs download_delay
    • return json vs class Item
    • message queue vs Pipeline
    • built-in url database vs set
    • Persistence vs In-memory
    • PyQuery + any third package you like vs built-in CSS/Xpath support

    In fact, I have not referred much from Scrapy. pyspider is really different from Scrapy.

    But, why not try it yourself? pyspider is also fast, has easy-to-use API and you can try it without install.

    0 讨论(0)
  • 2020-12-24 07:51

    Since I use both scrapy and pyspider, I would like to suggest the following:

    If the website is really small / simple, try pyspider first since it has almost everything you need

    • Use webui to setup project
    • Try the online code editor and view parse result instantly
    • View the result easily in browser
    • Run/Pause the project
    • Setup the expiration date so it can re-process the url

    However, if you tried pyspider and found it can't fit your needs, it's time to use scrapy. - migrate on_start to start_request - migrate index_page to parse - migrate detail_age to detail_age - change self.crawl to response.follow

    Then you are almost done. Now you can play with scrapy's advanced features like middleware, items, pipline etc.

    0 讨论(0)
提交回复
热议问题