scrapy-splash

Scrapy + Splash: scraping element inside inner html

情到浓时终转凉″ 提交于 2019-12-01 01:55:42
I'm using Scrapy + Splash to crawl webpages and try to extract data form google ad banners and other ads and I'm having difficulty getting scrapy to follow the xpath into them. I'm using the Scrpay-Splash API to render the pages so their scripts and images load and to take screenshots but it seems google ad banners are created by JS scripts that then insert their contents into a new html document within an iframe in the webpage, as so: Splash makes sure the code is rendered so I don't run into the usual problem scrapy has with scripts where it reads the script's content instead of it's

“500 Internal Server Error” when combining Scrapy over Splash with an HTTP proxy

白昼怎懂夜的黑 提交于 2019-12-01 01:06:56
I'm trying to crawl a Scrapy spider in a Docker container using both Splash (to render JavaScript) and Tor through Privoxy (to provide anonymity). Here is the docker-compose.yml I'm using to this end: version: '3' services: scraper: build: ./apk_splash # environment: # - http_proxy=http://tor-privoxy:8118 links: - tor-privoxy - splash tor-privoxy: image: rdsubhas/tor-privoxy-alpine splash: image: scrapinghub/splash where the Scraper has the following Dockerfile : FROM python:alpine RUN apk --update add libxml2-dev libxslt-dev libffi-dev gcc musl-dev libgcc openssl-dev curl bash RUN pip install

Using docker, scrapy splash on Heroku

六眼飞鱼酱① 提交于 2019-11-30 20:59:54
I have a scrapy spider that uses splash which runs on Docker localhost:8050 to render javascript before scraping. I am trying to run this on heroku but have no idea how to configure heroku to start docker to run splash before running my web: scrapy crawl abc dyno. Any guides is greatly appreciated! From what I gather you're expecting: Splash instance running on Heroku via Docker container Your web application (Scrapy spider) running in a Heroku dyno Splash instance Ensure you can have docker CLI and heroku CLI installed As seen in Heroku's Container Registry - Pushing existing image(s) :

how does scrapy-splash handle infinite scrolling?

人走茶凉 提交于 2019-11-30 10:15:19
I want to reverse engineering the contents generated by scrolling down in the webpage. The problem is in the url https://www.crowdfunder.com/user/following_page/80159?user_id=80159&limit=0&per_page=20&screwrand=933 . screwrand doesn't seem to follow any pattern, so the reversing the urls don't work. I'm considering the automatic rendering using Splash. How to use Splash to scroll like browsers? Thanks a lot! Here are the codes for two request: request1 = scrapy_splash.SplashRequest('https://www.crowdfunder.com/user/following/{}'.format(user_id), self.parse_follow_relationship, args={'wait':2},

Splash lua script to do multiple clicks and visits

浪尽此生 提交于 2019-11-30 09:57:28
I'm trying to crawl Google Scholar search results and get all the BiBTeX format of each result matching the search. Right now I have a Scrapy crawler with Splash. I have a lua script which will click the "Cite" link and load up the modal window before getting the href of the BibTeX format of the citation. But seeing that there are multiple search results and hence multiple "Cite" links, I need to click them all and load up the individual BibTeX pages. Here's what I have: import scrapy from scrapy_splash import SplashRequest class CiteSpider(scrapy.Spider): name = "cite" allowed_domains = [

Adding a wait-for-element while performing a SplashRequest in python Scrapy

﹥>﹥吖頭↗ 提交于 2019-11-30 07:01:23
I am trying to scrape a few dynamic websites using Splash for Scrapy in python. However, I see that Splash fails to wait for the complete page to load in certain cases. A brute force way to tackle this problem was to add a large wait time (eg. 5 seconds in the below snippet). However, this is extremely inefficient and still fails to load certain data (sometimes it take longer than 5 seconds to load the content). Is there some sort of a wait-for-element condition that can be put through these requests? yield SplashRequest( url, self.parse, args={'wait': 5}, 'User-Agent':"Mozilla/5.0 (X11; Linux

Adding a wait-for-element while performing a SplashRequest in python Scrapy

走远了吗. 提交于 2019-11-29 07:52:34
问题 I am trying to scrape a few dynamic websites using Splash for Scrapy in python. However, I see that Splash fails to wait for the complete page to load in certain cases. A brute force way to tackle this problem was to add a large wait time (eg. 5 seconds in the below snippet). However, this is extremely inefficient and still fails to load certain data (sometimes it take longer than 5 seconds to load the content). Is there some sort of a wait-for-element condition that can be put through these

Scrapy Shell and Scrapy Splash

烂漫一生 提交于 2019-11-28 03:43:31
We've been using scrapy-splash middleware to pass the scraped HTML source through the Splash javascript engine running inside a docker container. If we want to use Splash in the spider, we configure several required project settings and yield a Request specifying specific meta arguments : yield Request(url, self.parse_result, meta={ 'splash': { 'args': { # set rendering arguments here 'html': 1, 'png': 1, # 'url' is prefilled from request url }, # optional parameters 'endpoint': 'render.json', # optional; default is render.json 'splash_url': '<url>', # overrides SPLASH_URL 'slot_policy':

how does scrapy-splash handle infinite scrolling?

隐身守侯 提交于 2019-11-27 16:47:57
问题 I want to reverse engineering the contents generated by scrolling down in the webpage. The problem is in the url https://www.crowdfunder.com/user/following_page/80159?user_id=80159&limit=0&per_page=20&screwrand=933 . screwrand doesn't seem to follow any pattern, so the reversing the urls don't work. I'm considering the automatic rendering using Splash. How to use Splash to scroll like browsers? Thanks a lot! Here are the codes for two request: request1 = scrapy_splash.SplashRequest('https:/