scrapy-splash

Does using scrapy-splash significantly affect scraping speed? [closed]

馋奶兔 提交于 2019-12-09 16:23:54
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed last year . So far, I have been using just scrapy and writing custom classes to deal with websites using ajax. But if I were to use scrapy-splash, which from what I understand, scrapes the rendered html after javascript, will the speed of my crawler be affected significantly? What would be the

Read cookies from Splash request

元气小坏坏 提交于 2019-12-07 17:17:08
问题 I'm trying to access cookies after I've made a request using Splash. Below is how I've build the request. script = """ function main(splash) splash:init_cookies(splash.args.cookies) assert(splash:go{ splash.args.url, headers=splash.args.headers, http_method=splash.args.http_method, body=splash.args.body, }) assert(splash:wait(0.5)) local entries = splash:history() local last_response = entries[#entries].response return { url = splash:url(), headers = last_response.headers, http_status = last

Scrapy Splash won't execute lua script

半腔热情 提交于 2019-12-07 08:00:52
问题 I have ran across an issue in which my Lua script refuses to execute. The returned response from the ScrapyRequest call seems to be an HTML body, while i'm expecting a document title. I am assuming that the Lua script is never being called as it seems to have no apparent effect on the response. I have dug a lot through the documentation and can't quite seem to figure out what is missing here. Does anyone have any suggestions? from urlparse import urljoin import scrapy from scrapy_splash

CrawlSpider with Splash getting stuck after first URL

主宰稳场 提交于 2019-12-05 21:31:37
I'm writing a scrapy spider where I need to render some of the responses with splash. My spider is based on CrawlSpider. I need to render my start_url responses to feed my crawl spider. Unfortunately my crawl spider stops after rendering of the first responds. Any idea what is going wrong? class VideoSpider(CrawlSpider): start_urls = ['https://juke.com/de/de/search?q=1+Mord+f%C3%BCr+2'] rules = ( Rule(LinkExtractor(allow=()), callback='parse_items',process_request = "use_splash",), ) def use_splash(self, request): request.meta['splash'] = { 'endpoint':'render.html', 'args':{ 'wait':0.5, } }

Scrapy + Splash (Docker) Issue

你说的曾经没有我的故事 提交于 2019-12-05 08:00:07
问题 I have scrapy and scrapy-splash set up on a AWS Ubuntu server. It works fine for a while, but after a few hours I'll start getting error messages like this; Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.5/site- packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File "/home/ubuntu/.local/lib/python3.5/site- packages/twisted/python/failure.py", line 393, in throwExceptionIntoGe nerator return g.throw(self

How to install python-gtk2, python-webkit and python-jswebkit on OSX

天大地大妈咪最大 提交于 2019-12-04 06:39:48
I've read through many of the related questions but am still unclear how to do this as there are many software combinations available and many solutions seem outdated. What is the best way to install the following on my virtual environment on OSX: python-gtk2 python-webkit python-jswebkit Do I also have to install GTK+ and Webkit? If so, how? Would also appreciate a simple explanation on how these pieces of software work together. (I'm trying to use scrapyjs which requires these libraries) You should try using pip (A tool for installing and managing Python packages.) https://pypi.python.org

Does using scrapy-splash significantly affect scraping speed? [closed]

牧云@^-^@ 提交于 2019-12-04 04:41:28
So far, I have been using just scrapy and writing custom classes to deal with websites using ajax. But if I were to use scrapy-splash, which from what I understand, scrapes the rendered html after javascript, will the speed of my crawler be affected significantly? What would be the comparison between time it takes to scrape a vanilla html page with scrapy vs javascript rendered html with scrapy-splash? And lastly, how do scrapy-splash and Selenium compare? It depends on the amount of javascript present on the page. You must know that to render all the javascript the splash takes some time and

Scrapy + Splash (Docker) Issue

走远了吗. 提交于 2019-12-03 23:19:52
I have scrapy and scrapy-splash set up on a AWS Ubuntu server. It works fine for a while, but after a few hours I'll start getting error messages like this; Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.5/site- packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File "/home/ubuntu/.local/lib/python3.5/site- packages/twisted/python/failure.py", line 393, in throwExceptionIntoGe nerator return g.throw(self.type, self.value, self.tb) File "/home/ubuntu/.local/lib/python3.5/site- packages/scrapy/core

“500 Internal Server Error” when combining Scrapy over Splash with an HTTP proxy

雨燕双飞 提交于 2019-12-03 22:47:51
问题 I'm trying to crawl a Scrapy spider in a Docker container using both Splash (to render JavaScript) and Tor through Privoxy (to provide anonymity). Here is the docker-compose.yml I'm using to this end: version: '3' services: scraper: build: ./apk_splash # environment: # - http_proxy=http://tor-privoxy:8118 links: - tor-privoxy - splash tor-privoxy: image: rdsubhas/tor-privoxy-alpine splash: image: scrapinghub/splash where the Scraper has the following Dockerfile : FROM python:alpine RUN apk -

Scrapy + Splash: scraping element inside inner html

有些话、适合烂在心里 提交于 2019-12-03 22:44:05
问题 I'm using Scrapy + Splash to crawl webpages and try to extract data form google ad banners and other ads and I'm having difficulty getting scrapy to follow the xpath into them. I'm using the Scrpay-Splash API to render the pages so their scripts and images load and to take screenshots but it seems google ad banners are created by JS scripts that then insert their contents into a new html document within an iframe in the webpage, as so: Splash makes sure the code is rendered so I don't run