scrapy

Terminate Scrapy if a condition is met

大兔子大兔子 提交于 2019-12-21 02:52:26
问题 I have written a scraper using scrapy in python. It contains 100 start_urls. I want to terminate the scraping process once a condition is met. ie terminate scraping of a particular div is found. By terminate I mean it should stop scraping all the urls . Is it possible 回答1: What you're looking for is the CloseSpider exception. Add the following line somewhere at the top of your source file: from scrapy.exceptions import CloseSpider And when you detect that your termination condition is met,

scrapy入门案例

隐身守侯 提交于 2019-12-20 23:36:04
一. 新建项目(scrapy startproject) 在开始爬取之前,必须创建一个新的Scrapy项目。进入自定义的项目目录中,运行下列命令: scrapy startproject scrapyDemo 其中, mySpider 为项目名称,可以看到将会创建一个 scrapyDemo 文件夹,目录结构大致如下: 各个主要文件的作用: scrapy.cfg :项目的配置文件 scrapyDemo / :项目的Python模块,将会从这里引用代码 scrapyDemo /items.py :项目的目标文件 scrapyDemo /pipelines.py :项目的管道文件 scrapyDemo /settings.py :项目的设置文件 scrapyDemo /spiders/ :存储爬虫代码目录 抓取:https://www.cnblogs.com/loaderman/ 网站里的所有讲师的标题、描述、详情链接地址和日期。 打开 scrapyDemo 目录下的items.py Item 定义结构化数据字段,用来保存爬取到的数据,有点像Python中的dict,但是提供了一些额外的保护减少错误。 可以通过创建一个 scrapy.Item 类, 并且定义类型为 scrapy.Field的类属性来定义一个Item(可以理解成类似于ORM的映射关系)。 接下来

Scrapy部署到Scrapyd服务器

做~自己de王妃 提交于 2019-12-20 23:10:58
1.环境搭配 1.1安装scrapyd pip install scrapyd Scrapyd是一个服务,用来运行scrapy爬虫的 它允许你部署你的scrapy项目以及通过HTTP JSON的方式控制你的爬虫 官方文档: http://scrapyd.readthedocs.org/ DOS 成功样例 可视化成功样例 输入 url: http://127.0.0.1:6800/ 1.2安装Scrapy-client pip install scrapy-client scrapy-client 相当于一个客户端 它允许我们将本地的scrapy项目打包发送到scrapyd 这个服务端 2.部署 部署Scrapyd 首先进入Scrapy框架的根目录,里面有一个 scrapy.cfg 文件 本scrapy项目叫 demo 修改 scrapy.cfg 文件 首先将url的注释去除,url,就是我们要将scrapy部署到的服务器地址 随后将 [deploy] 改为 [deploy:scrapyd1] ,这里需要注意的是scrapyd1是我们自己定义的,任意均可。 进入 D:\ProgramData\Anaconda3\Scripts (python安装的地址) 在此目录下创建文件 scrapyd-deploy.bat 文件的内容为(内容中的 D:\ProgramData\Anaconda3

scrapy-redis 分布式爬虫

ぃ、小莉子 提交于 2019-12-20 19:56:45
一、介绍 原来scrapy的Scheduler维护的是本机的任务队列(存放Request对象及其回调函数等信息)+本机的去重队列(存放访问过的url地址) 所以实现分布式爬取的关键就是,找一台专门的主机上运行一个共享的队列比如Redis, 然后重写Scrapy的Scheduler,让新的Scheduler到共享队列存取Request,并且去除重复的Request请求,所以总结下来,实现分布式的关键就是三点: #1、共享队列 #2、重写Scheduler,让其无论是去重还是任务都去访问共享队列 #3、为Scheduler定制去重规则(利用redis的集合类型) #安装: pip3 install scrapy-redis #源码: D:\python3.6\Lib\site-packages\scrapy_redis 二、scrapy-redis组件(settings配置过滤和调度) 1、只使用scrapy-redis的去重功能 源码分析 #一、源码:D:\python3.6\Lib\site-packages\scrapy_redis\dupefilter.py #二、配置scrapy使用redis提供的共享去重队列 #2.1 在settings.py中配置链接Redis REDIS_HOST = 'localhost' # 主机名 REDIS_PORT = 6379 # 端口

Multi POST query (session mode)

巧了我就是萌 提交于 2019-12-20 18:46:23
问题 I am trying to interrogate this site to get the list of offers. The problem is that we need to fill 2 forms (2 POST queries) before receiving the final result. This what I have done so far: First I am sending the first POST after setting the cookies: library(httr) set_cookies(.cookies = c(a = "1", b = "2")) first_url <- "https://compare.switchon.vic.gov.au/submit" body <- list(energy_category="electricity", location="home", "location-home"="shift", "retailer-company"="", postcode="3000",

爬虫框架 Scrapy

荒凉一梦 提交于 2019-12-20 18:30:41
一 介绍 crapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的,使用它可以以快速、简单、可扩展的方式从网站中提取所需的数据。但目前Scrapy的用途十分广泛,可用于如数据挖掘、监测和自动化测试等领域,也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。 Scrapy 是基于twisted框架开发而来,twisted是一个流行的事件驱动的python网络框架。因此Scrapy使用了一种非阻塞(又名异步)的代码来实现并发。整体架构大致如下 The data flow in Scrapy is controlled by the execution engine, and goes like this: The Engine gets the initial Requests to crawl from the Spider . The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl. The Scheduler returns the next Requests to the Engine . The Engine sends the

ImportError : cannot import name '_win32stdio'

别来无恙 提交于 2019-12-20 15:07:49
问题 I am working with Scrapy framework to scrap out data from website, but getting the following error in command prompt: ImportError: cannot import name '_win32stdio' Traceback is attached as a screenshot. Kindly revert if require directory structure of my program's directory. 回答1: Scrapy can work with Python 3 on windows if you make some minor adjustments: Copy the _win32stdio and _pollingfile to the appropriate directory under site-packages. Namely, twisted-dir\internet. Download these from

Logging to specific error log file in scrapy

瘦欲@ 提交于 2019-12-20 13:15:49
问题 I am running a log of scrapy by doing this: from scrapy import log class MySpider(BaseSpider): name = "myspider" def __init__(self, name=None, **kwargs): LOG_FILE = "logs/spider.log" log.log.defaultObserver = log.log.DefaultObserver() log.log.defaultObserver.start() log.started = False log.start(LOG_FILE, loglevel=log.INFO) super(MySpider, self).__init__(name, **kwargs) def parse(self,response): .... raise Exception("Something went wrong!") log.msg('Something went wrong!', log.ERROR) #

Newbie: How to overcome Javascript “onclick” button to scrape web page?

天涯浪子 提交于 2019-12-20 12:37:39
问题 This is the link I want to scrape: http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=MMFU_U The "English Version" tab is at the upper right hand corner in order to show the English version of the web page. There is a button I have to press in order to read the funds information on the web page. If not, the view is blocked, and using scrapy shell always result empty []. <div onclick="AgreeClick()" style="width:200px; padding:8px; border:1px black solid;

Scrapy : storing the data

梦想的初衷 提交于 2019-12-20 12:34:41
问题 I'm new with python and scrapy. I'm tring to follow the Scrapy tutorial but I don't understand the logic of the storage step. scrapy crawl spidername -o items.json -t json scrapy crawl spidername --set FEED_URI=output.csv --set FEED_FORMAT=csv I dont understand the signification of : -o -t --set Thank you for your help 回答1: You can view a list of available commands by typing scrapy crawl -h from within your project directory. scrapy crawl spidername -o items.json -t json -o specifies the