scrapy | 易学教程

scrapy和scrapy-redis的区别

阅读更多关于 scrapy和scrapy-redis的区别

scrapy是一个python爬虫框架，爬取的效率极高，具有高度的定制性，但是不支持分布式。而scrapy-redis是一套基于redis库，运行在scrapy框架之上的组件，可以让scapy支持分布式策略 Slaver端共享Master端redis数据库里的item 队列、请求队列和请求指纹集合。选择redis数据库的原因：　　redis支持主从同步，而且数据都是缓存在内存中的，所以基于redis的分布式爬虫，对请求和数据的高频率读取效率都非常高　　　　scrapy-redis和scrapy的关系就像电脑和固态硬盘一样，是电脑中的一个插件，能让电脑更快的运行　　　　scrapy是一个爬虫框架，scrapy-redis则是这个框架上可以选择的插件，它可以让爬虫跑得更解释说明：从优先级队列中获取requests对象，交给engine engine将requests对此昂交给下载器下载，期间会通过downlomiddleware的process_request方法下载器完成下载，获得response对象，将该对象交给engine，期间会经过downloadmiddleware的process_response()方法 engine将获得的response对象交给spider进行解析，期间会经过spidermiddleware的process_spider_input(

How to call particular Scrapy spiders from another Python script

阅读更多关于 How to call particular Scrapy spiders from another Python script

问题 I have a script called algorithm.py and I want to be able to call Scrapy spiders during the script. The file scructure is: algorithm.py MySpiders/ where MySpiders is a folder containing several scrapy projects. I would like to create methods perform_spider1(), perform_spider2()... which I can call in algorithm.py. How do I construct this method? I have managed to call one spider using the following code, however, it's not a method and it only works for one spider. I'm a beginner in need of

Python rapidly creating and removing directories will cause WindowsError [Error 5] intermittently

阅读更多关于 Python rapidly creating and removing directories will cause WindowsError [Error 5] intermittently

问题 I encountered this problem while using Scrapy's FifoDiskQueue . In windows, FifoDiskQueue will cause directories and files to be created by one file descriptor and consumed (and if no more message in the queue, removed) by another file descriptor. I will get error messages like the following, randomly: 2015-08-25 18:51:30 [scrapy] INFO: Error while handling downloader output Traceback (most recent call last): File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 588, in

How to store Response in Scrapy? [closed]

阅读更多关于 How to store Response in Scrapy? [closed]

问题 Closed . This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed 6 years ago . Improve this question I want to store response of the request in scrapy .i have the following code for the time being. yield Request(requestURL, callback=self.afterResponse) Now what i want not to call the function afterResponse upon arrival of response but to store it here somehow so that i can

Fetch pages with scrapy behind Google Authentication

阅读更多关于 Fetch pages with scrapy behind Google Authentication

问题 I'm trying to log into a website that uses Google credentials. This fails in my scrapy spider: def parse(self, response): return scrapy.FormRequest.from_response( response, formdata={'email': self.var.user, 'password': self.var.password}, callback=self.after_login) Any tips? 回答1: After further inspection I managed to solve this, seems to be, a simple issue: The fields are Email and Passwd , in that order. Break the log in into two request, the first for email, second for password. The code

Can't install Scrapyd on EC2

阅读更多关于 Can't install Scrapyd on EC2

问题 i'm trying to install scrapyd service in a EC2 instance to deploy a scrapy project . what i have done : 1- Imported the GPG key used to sign Scrapy packages into APT keyring: sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7 2- Created /etc/apt/sources.list.d/scrapy.list file using the following command: sudo su -c "echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee /etc/apt/sources.list.d/scrapy.list" 3- installing scrapyd sudo apt-get update && sudo

can't create project in scrapy says dll load failed

阅读更多关于 can't create project in scrapy says dll load failed

问题 from cryptography.hazmat.bindings._openssl import ffi, lib ImportError: DLL load failed: The operating system cannot run %1. i installed scrapy through conda by conda install scrapy -c conda-forge 回答1: me too i meet this problem under windows 10 , after many search on many websites . i found this solution : download this : https://github.com/python/cpython-bin-deps/tree/openssl-bin-1.0.2k zip the file and copy the folder (amd or win ) in your sys path : C:\Windows\SysWOW64 and voila every

Scrapy Extract ld+JSON

阅读更多关于 Scrapy Extract ld+JSON

问题 How to extract the name and url? quotes_spiders.py import scrapy import json class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = ["http://www.lazada.com.my/shop-power-banks2/?price=1572-1572"] def parse(self, response): data = json.loads(response.xpath('//script[@type="application/ld+json"]//text()').extract_first()) //how to extract the name and url? yield data Data to Extract <script type="application/ld+json">{"@context":"https://schema.org","@type":"ItemList","itemListElement"

Stop scrapy from redirecting to country specific domain

阅读更多关于 Stop scrapy from redirecting to country specific domain

问题 I am trying to extract data from airbnb.com. But whenever I tried to access that website with .com in its domain, it's redirecting to a domain having .ca. Here is a code snippet which I think would describe my issue In [46]: fetch(url) 2021-02-05 09:17:36 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (307) to <GET https://www.airbnb.ca/s/nova/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&date_picker_type=calendar&source=structured_search_input_header&search_type=search

How to use http and https proxy together in scrapy?

阅读更多关于 How to use http and https proxy together in scrapy?

问题 I am new in scrapy. I found that for use http proxy but I want to use http and https proxy together because when I crawl the links there has http and https links. How do I use also http and https proxy? class ProxyMiddleware(object): def process_request(self, request, spider): request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT" #like here request.meta['proxy'] = "https://YOUR_PROXY_IP:PORT" proxy_user_pass = "USERNAME:PASSWORD" # setup basic authentication for the proxy encoded_user_pass =