Does using scrapy-splash significantly affect scraping speed? [closed]

馋奶兔 提交于 2019-12-09 16:23:54

问题


So far, I have been using just scrapy and writing custom classes to deal with websites using ajax.

But if I were to use scrapy-splash, which from what I understand, scrapes the rendered html after javascript, will the speed of my crawler be affected significantly?

What would be the comparison between time it takes to scrape a vanilla html page with scrapy vs javascript rendered html with scrapy-splash?

And lastly, how do scrapy-splash and Selenium compare?


回答1:


It depends on the amount of javascript present on the page.

You must know that to render all the javascript the splash takes some time and the python application proceeds without waiting for the rendering to be complete. So sometimes splash is also not able to do it.

  • You can explicitly put a wait for rendering as it needs some time generally.
  • Also it is a good practice to put up some wait.

Here,

import scrapy
from scrapy_splash import SplashRequest

yield scrapy.Request(url, callback=self.parse, meta={'splash':{'args':{'wait':'25'},'endpoint':'render.html'}})

or

import scrapy
from scrapy_splash import SplashRequest

yield SplashRequest(url, self.parse, endpoint='render.html',
        args={'wait': 5, 'html' : 1 } ) 

Between scrapy and selenium

Selenium is only used to automate web browser interaction, Scrapy is used to download HTML, process data and save it(whole web crawling framework).

Talking about scraping I would recommend scrapy and if the problem is javascript.

  • Scrapy already has its own official project for javascript called scrapy-splash
  • Also, you can create new instance of webdriver from Selenium in the scrapy spider, do some work, extract the data, and then close it after all work done.


来源:https://stackoverflow.com/questions/49891688/does-using-scrapy-splash-significantly-affect-scraping-speed

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!