scrapy-splash | 易学教程

Does using scrapy-splash significantly affect scraping speed? [closed]

阅读更多关于 Does using scrapy-splash significantly affect scraping speed? [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed last year . So far, I have been using just scrapy and writing custom classes to deal with websites using ajax. But if I were to use scrapy-splash, which from what I understand, scrapes the rendered html after javascript, will the speed of my crawler be affected significantly? What would be the

Read cookies from Splash request

阅读更多关于 Read cookies from Splash request

问题 I'm trying to access cookies after I've made a request using Splash. Below is how I've build the request. script = """ function main(splash) splash:init_cookies(splash.args.cookies) assert(splash:go{ splash.args.url, headers=splash.args.headers, http_method=splash.args.http_method, body=splash.args.body, }) assert(splash:wait(0.5)) local entries = splash:history() local last_response = entries[#entries].response return { url = splash:url(), headers = last_response.headers, http_status = last

Scrapy Splash won't execute lua script

阅读更多关于 Scrapy Splash won't execute lua script

问题 I have ran across an issue in which my Lua script refuses to execute. The returned response from the ScrapyRequest call seems to be an HTML body, while i'm expecting a document title. I am assuming that the Lua script is never being called as it seems to have no apparent effect on the response. I have dug a lot through the documentation and can't quite seem to figure out what is missing here. Does anyone have any suggestions? from urlparse import urljoin import scrapy from scrapy_splash

CrawlSpider with Splash getting stuck after first URL

阅读更多关于 CrawlSpider with Splash getting stuck after first URL

I'm writing a scrapy spider where I need to render some of the responses with splash. My spider is based on CrawlSpider. I need to render my start_url responses to feed my crawl spider. Unfortunately my crawl spider stops after rendering of the first responds. Any idea what is going wrong? class VideoSpider(CrawlSpider): start_urls = ['https://juke.com/de/de/search?q=1+Mord+f%C3%BCr+2'] rules = ( Rule(LinkExtractor(allow=()), callback='parse_items',process_request = "use_splash",), ) def use_splash(self, request): request.meta['splash'] = { 'endpoint':'render.html', 'args':{ 'wait':0.5, } }

Scrapy + Splash (Docker) Issue

阅读更多关于 Scrapy + Splash (Docker) Issue

问题 I have scrapy and scrapy-splash set up on a AWS Ubuntu server. It works fine for a while, but after a few hours I'll start getting error messages like this; Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.5/site- packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File "/home/ubuntu/.local/lib/python3.5/site- packages/twisted/python/failure.py", line 393, in throwExceptionIntoGe nerator return g.throw(self

How to install python-gtk2, python-webkit and python-jswebkit on OSX

阅读更多关于 How to install python-gtk2, python-webkit and python-jswebkit on OSX

I've read through many of the related questions but am still unclear how to do this as there are many software combinations available and many solutions seem outdated. What is the best way to install the following on my virtual environment on OSX: python-gtk2 python-webkit python-jswebkit Do I also have to install GTK+ and Webkit? If so, how? Would also appreciate a simple explanation on how these pieces of software work together. (I'm trying to use scrapyjs which requires these libraries) You should try using pip (A tool for installing and managing Python packages.) https://pypi.python.org

Does using scrapy-splash significantly affect scraping speed? [closed]

阅读更多关于 Does using scrapy-splash significantly affect scraping speed? [closed]

So far, I have been using just scrapy and writing custom classes to deal with websites using ajax. But if I were to use scrapy-splash, which from what I understand, scrapes the rendered html after javascript, will the speed of my crawler be affected significantly? What would be the comparison between time it takes to scrape a vanilla html page with scrapy vs javascript rendered html with scrapy-splash? And lastly, how do scrapy-splash and Selenium compare? It depends on the amount of javascript present on the page. You must know that to render all the javascript the splash takes some time and

Scrapy + Splash (Docker) Issue

阅读更多关于 Scrapy + Splash (Docker) Issue

I have scrapy and scrapy-splash set up on a AWS Ubuntu server. It works fine for a while, but after a few hours I'll start getting error messages like this; Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.5/site- packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File "/home/ubuntu/.local/lib/python3.5/site- packages/twisted/python/failure.py", line 393, in throwExceptionIntoGe nerator return g.throw(self.type, self.value, self.tb) File "/home/ubuntu/.local/lib/python3.5/site- packages/scrapy/core

“500 Internal Server Error” when combining Scrapy over Splash with an HTTP proxy

阅读更多关于 “500 Internal Server Error” when combining Scrapy over Splash with an HTTP proxy

问题 I'm trying to crawl a Scrapy spider in a Docker container using both Splash (to render JavaScript) and Tor through Privoxy (to provide anonymity). Here is the docker-compose.yml I'm using to this end: version: '3' services: scraper: build: ./apk_splash # environment: # - http_proxy=http://tor-privoxy:8118 links: - tor-privoxy - splash tor-privoxy: image: rdsubhas/tor-privoxy-alpine splash: image: scrapinghub/splash where the Scraper has the following Dockerfile : FROM python:alpine RUN apk -

Scrapy + Splash: scraping element inside inner html

阅读更多关于 Scrapy + Splash: scraping element inside inner html

问题 I'm using Scrapy + Splash to crawl webpages and try to extract data form google ad banners and other ads and I'm having difficulty getting scrapy to follow the xpath into them. I'm using the Scrpay-Splash API to render the pages so their scripts and images load and to take screenshots but it seems google ad banners are created by JS scripts that then insert their contents into a new html document within an iframe in the webpage, as so: Splash makes sure the code is rendered so I don't run