scrapy | 易学教程

Scrapy XPath all the links on the page

阅读更多关于 Scrapy XPath all the links on the page

问题 I am trying to collect all the URLs under a domain using Scrapy. I was trying to use the CrawlSpider to start from the homepage and crawl their web. For each page, I want to use Xpath to extract all the hrefs. And store the data in a format like key-value pair. Key: the current Url Value: all the links on this page. class MySpider(CrawlSpider): name = 'abc.com' allowed_domains = ['abc.com'] start_urls = ['http://www.abc.com'] rules = (Rule(SgmlLinkExtractor()), ) def parse_item(self, response

scrapy-redis

阅读更多关于 scrapy-redis

Scrapy 和 scrapy-redis的区别 Scrapy 是一个通用的爬虫框架，但是不支持分布式，Scrapy-redis是为了更方便地实现Scrapy分布式爬取，而提供了一些以redis为基础的组件(仅有组件)。 pip install scrapy-redis Scrapy-redis提供了下面四种组件（components）：(四种组件意味着这四个模块都要做相应的修改) Scheduler Duplication Filter Item Pipeline Base Spider scrapy-redis架构如上图所⽰示，scrapy-redis在scrapy的架构上增加了redis，基于redis的特性拓展了如下组件： Scheduler ： Scrapy改造了python本来的collection.deque(双向队列)形成了自己的Scrapy queue( https://github.com/scrapy/queuelib/blob/master/queuelib/queue.py) )，但是Scrapy多个spider不能共享待爬取队列Scrapy queue，即Scrapy本身不支持爬虫分布式，scrapy-redis 的解决是把这个Scrapy queue换成redis数据库（也是指redis队列），从同一个redis

scrapy how spider returns value to another spider

阅读更多关于 scrapy how spider returns value to another spider

问题 The website that I am crawling contains many players and when I click on any player, I can go the his page. The website structure is like this: <main page> <link to player 1> <link to player 2> <link to player 3> .. .. .. <link to payer n> </main page> And when I click on any link, I go to player's page which is like this: <player name> <player team> <player age> <player salary> <player date> I want to scrap all the players those age is between 20 and 25 years. what I am doing scraping the

Cannot install lxml on windows, fatal error C1083: Cannot open include file: 'libxml/xmlversion.h'

阅读更多关于 Cannot install lxml on windows, fatal error C1083: Cannot open include file: 'libxml/xmlversion.h'

问题 Python noob, please bear with me. I used python installer for v3.5.1 from www.python.org. My intent was to use Scrapy to run some scripts. pip install scrapy failed, as did easy_install scrapy and others. I traced the error to a faulty install of lxml. Here is the error log. I've even tried easy_installing libxml2, I'm not sure how to proceed. Building lxml version 3.5.0. Building without Cython. ERROR: b"'xslt-config' is not recognized as an internal or external command,\r\noperable program

关于scrapy的爬虫重新请求的问题

阅读更多关于关于scrapy的爬虫重新请求的问题

关于scrapy的爬虫重新请求的问题解释器和库的版本问题：解释器和库的版本 python版本：3.7 scrapy版本：1.6.0 问题：在进行请求获取响应时，会遇到异常响应，此时如何通过中间件重新发起请求。来源： CSDN 作者： qq_42553453 链接： https://blog.csdn.net/qq_42553453/article/details/103588419

How to create a pg_trgm index using SQLAlchemy for Scrapy?

阅读更多关于 How to create a pg_trgm index using SQLAlchemy for Scrapy?

问题 I am using Scrapy to scrape data from a web forum. I am storing this data in a PostgreSQL database using SQLAlchemy. The table and columns create fine, however, I am not able to have SQLAlchemy create an index on one of the columns. I am trying to create a trigram index (pg_trgm) using gin. The Postgresql code that would create this index is: CREATE INDEX description_idx ON table USING gin (description gin_trgm_ops); The SQLAlchemy code I have added to my models.py file is: desc_idx = Index(

scrapy djangoitem with Foreign Key

阅读更多关于 scrapy djangoitem with Foreign Key

问题 This question was asked here Foreign Keys on Scrapy without an accepted answer, so I am here to re-raise the question with a clearer defined minimum set up: The django model: class Article(models.Model): title = models.CharField(max_length=255) content = models.TextField() category = models.ForeignKey('categories.Category', null=True, blank=True) Note how category is defined is irrelevant here, but it does use ForeignKey . So, in django shell, this would work: c = Article(title="foo", content

Scrapy:In a request fails (eg 404,500), how to ask for another alternative request?

阅读更多关于 Scrapy:In a request fails (eg 404,500), how to ask for another alternative request?

问题 I have a problem with scrapy. In a request fails (eg 404,500), how to ask for another alternative request? Such as two links can obtain price info, the one failed, request another automatically. 回答1: Use "errback" in the Request like errback=self.error_handler where error_handler is a function (just like callback function) in this function check the error code and make the alternative Request. see errback in the scrapy documentation: http://doc.scrapy.org/en/latest/topics/request-response

scrapy爬虫框架学习

阅读更多关于 scrapy爬虫框架学习

文章目录一、配置环境： 1.安装插件：（1）twisted （2）scrapy 二、创建项目三、实战 1.创建项目： 2.创建爬虫 3.打开项目 4.定义字段 5.编写爬虫文件 6.数据处理 7.更改配置 8.运行程序 9.翻页 10.数据保存到MySQL 一、配置环境： 1.安装插件：（1）twisted 虽然安装scrapy时会自动安装，但是安装的不全，所以还是先自己安装比较好下载。根据自己的python版本和系统版本下载 twisted 安装。dos命令进入到twisted安装包的文件路径下，执行以下命令 pip install 文件名.whl （2）scrapy 使用管理员身份运行cmd,执行以下命令 pip install scrapy 出现 Successfully installed 字样就代表安装成功了二、创建项目 scrapy startproject 项目名称如 scrapy startproject SearchSpider 然后进入项目中spiders目录下如： \\SearchSpider\SearchSpider\spiders 输入 scrapy genspider 爬虫名称 "爬取的域名" 如 scrapy genspider search "baidu.com" 三、实战我们来爬取网易财经的新股数据：http://quotes

安装python爬虫scrapy踩过的那些坑和编程外的思考

阅读更多关于安装python爬虫scrapy踩过的那些坑和编程外的思考

简介 Elastalert是用python2写的一个报警框架(目前支持python2.6和2.7，不支持3.x)，github地址为 https://github.com/Yelp/elastalert。他提供不同场景的规则配置，若觉得规则、告警不满足需求时，可以用python编写插件Adding a New Rule Type、Adding a New Alerter。环境系统：centos6.8 python：2.7.12（请参看升级centos6 默认python版本到2.7.12 ） elasticsearch：5.5 kibana：5.5 Elastalert内置的告警方式： Email JIRA OpsGenie Commands HipChat MS Teams Slack Telegram AWS SNS VictorOps PagerDuty Exotel Twilio Gitter 安装 pip安装elastalert 安装pip包管理工具（参考） $ pip install elastalert 1 或者 git clone (推荐) $ git clone https://github.com/Yelp/elastalert.git 1 安装模块 $ pip install "setuptools>=11.3" $ python setup.py