scrapy

scrapy和scrapy-redis的区别

血红的双手。 提交于 2021-02-08 23:03:55
scrapy是一个python爬虫框架,爬取的效率极高,具有高度的定制性,但是不支持分布式。而scrapy-redis是一套基于redis库,运行在scrapy框架之上的组件,可以让scapy支持分布式策略 Slaver端共享Master端redis数据库里的item 队列、请求队列和请求指纹集合。 选择redis数据库的原因:   redis支持主从同步,而且数据都是缓存在内存中的,所以基于redis的分布式爬虫,对请求和数据的高频率读取效率都非常高     scrapy-redis和scrapy的关系就像电脑和固态硬盘一样,是电脑中的一个插件,能让电脑更快的运行     scrapy是一个爬虫框架,scrapy-redis则是这个框架上可以选择的插件,它可以让爬虫跑得更 解释说明: 从优先级队列中获取requests对象,交给engine engine将requests对此昂交给下载器下载,期间会通过downlomiddleware的process_request方法 下载器完成下载,获得response对象,将该对象交给engine,期间会经过downloadmiddleware的process_response()方法 engine将获得的response对象交给spider进行解析,期间会经过spidermiddleware的process_spider_input(

How to call particular Scrapy spiders from another Python script

我与影子孤独终老i 提交于 2021-02-08 13:53:13
问题 I have a script called algorithm.py and I want to be able to call Scrapy spiders during the script. The file scructure is: algorithm.py MySpiders/ where MySpiders is a folder containing several scrapy projects. I would like to create methods perform_spider1(), perform_spider2()... which I can call in algorithm.py. How do I construct this method? I have managed to call one spider using the following code, however, it's not a method and it only works for one spider. I'm a beginner in need of

Python rapidly creating and removing directories will cause WindowsError [Error 5] intermittently

自古美人都是妖i 提交于 2021-02-08 12:45:06
问题 I encountered this problem while using Scrapy's FifoDiskQueue . In windows, FifoDiskQueue will cause directories and files to be created by one file descriptor and consumed (and if no more message in the queue, removed) by another file descriptor. I will get error messages like the following, randomly: 2015-08-25 18:51:30 [scrapy] INFO: Error while handling downloader output Traceback (most recent call last): File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 588, in

How to store Response in Scrapy? [closed]

血红的双手。 提交于 2021-02-08 12:11:57
问题 Closed . This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed 6 years ago . Improve this question I want to store response of the request in scrapy .i have the following code for the time being. yield Request(requestURL, callback=self.afterResponse) Now what i want not to call the function afterResponse upon arrival of response but to store it here somehow so that i can

Fetch pages with scrapy behind Google Authentication

余生长醉 提交于 2021-02-08 10:44:21
问题 I'm trying to log into a website that uses Google credentials. This fails in my scrapy spider: def parse(self, response): return scrapy.FormRequest.from_response( response, formdata={'email': self.var.user, 'password': self.var.password}, callback=self.after_login) Any tips? 回答1: After further inspection I managed to solve this, seems to be, a simple issue: The fields are Email and Passwd , in that order. Break the log in into two request, the first for email, second for password. The code

Can't install Scrapyd on EC2

霸气de小男生 提交于 2021-02-08 10:28:19
问题 i'm trying to install scrapyd service in a EC2 instance to deploy a scrapy project . what i have done : 1- Imported the GPG key used to sign Scrapy packages into APT keyring: sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7 2- Created /etc/apt/sources.list.d/scrapy.list file using the following command: sudo su -c "echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee /etc/apt/sources.list.d/scrapy.list" 3- installing scrapyd sudo apt-get update && sudo

can't create project in scrapy says dll load failed

前提是你 提交于 2021-02-08 10:17:38
问题 from cryptography.hazmat.bindings._openssl import ffi, lib ImportError: DLL load failed: The operating system cannot run %1. i installed scrapy through conda by conda install scrapy -c conda-forge 回答1: me too i meet this problem under windows 10 , after many search on many websites . i found this solution : download this : https://github.com/python/cpython-bin-deps/tree/openssl-bin-1.0.2k zip the file and copy the folder (amd or win ) in your sys path : C:\Windows\SysWOW64 and voila every

Scrapy Extract ld+JSON

吃可爱长大的小学妹 提交于 2021-02-08 09:48:20
问题 How to extract the name and url? quotes_spiders.py import scrapy import json class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = ["http://www.lazada.com.my/shop-power-banks2/?price=1572-1572"] def parse(self, response): data = json.loads(response.xpath('//script[@type="application/ld+json"]//text()').extract_first()) //how to extract the name and url? yield data Data to Extract <script type="application/ld+json">{"@context":"https://schema.org","@type":"ItemList","itemListElement"

Stop scrapy from redirecting to country specific domain

大憨熊 提交于 2021-02-08 09:00:46
问题 I am trying to extract data from airbnb.com. But whenever I tried to access that website with .com in its domain, it's redirecting to a domain having .ca. Here is a code snippet which I think would describe my issue In [46]: fetch(url) 2021-02-05 09:17:36 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (307) to <GET https://www.airbnb.ca/s/nova/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&date_picker_type=calendar&source=structured_search_input_header&search_type=search

How to use http and https proxy together in scrapy?

淺唱寂寞╮ 提交于 2021-02-08 07:56:37
问题 I am new in scrapy. I found that for use http proxy but I want to use http and https proxy together because when I crawl the links there has http and https links. How do I use also http and https proxy? class ProxyMiddleware(object): def process_request(self, request, spider): request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT" #like here request.meta['proxy'] = "https://YOUR_PROXY_IP:PORT" proxy_user_pass = "USERNAME:PASSWORD" # setup basic authentication for the proxy encoded_user_pass =