scrapy

Scrapy XPath all the links on the page

瘦欲@ 提交于 2019-12-21 05:17:16
问题 I am trying to collect all the URLs under a domain using Scrapy. I was trying to use the CrawlSpider to start from the homepage and crawl their web. For each page, I want to use Xpath to extract all the hrefs. And store the data in a format like key-value pair. Key: the current Url Value: all the links on this page. class MySpider(CrawlSpider): name = 'abc.com' allowed_domains = ['abc.com'] start_urls = ['http://www.abc.com'] rules = (Rule(SgmlLinkExtractor()), ) def parse_item(self, response

scrapy-redis

给你一囗甜甜゛ 提交于 2019-12-21 05:11:43
Scrapy 和 scrapy-redis的区别 Scrapy 是一个通用的爬虫框架,但是不支持分布式,Scrapy-redis是为了更方便地实现Scrapy分布式爬取,而提供了一些以redis为基础的组件(仅有组件)。 pip install scrapy-redis Scrapy-redis提供了下面四种组件(components):(四种组件意味着这四个模块都要做相应的修改) Scheduler Duplication Filter Item Pipeline Base Spider scrapy-redis架构 如上图所⽰示,scrapy-redis在scrapy的架构上增加了redis,基于redis的特性拓展了如下组件: Scheduler : Scrapy改造了python本来的collection.deque(双向队列)形成了自己的Scrapy queue( https://github.com/scrapy/queuelib/blob/master/queuelib/queue.py) ),但是Scrapy多个spider不能共享待爬取队列Scrapy queue, 即Scrapy本身不支持爬虫分布式,scrapy-redis 的解决是把这个Scrapy queue换成redis数据库(也是指redis队列),从同一个redis

scrapy how spider returns value to another spider

不羁的心 提交于 2019-12-21 05:11:13
问题 The website that I am crawling contains many players and when I click on any player, I can go the his page. The website structure is like this: <main page> <link to player 1> <link to player 2> <link to player 3> .. .. .. <link to payer n> </main page> And when I click on any link, I go to player's page which is like this: <player name> <player team> <player age> <player salary> <player date> I want to scrap all the players those age is between 20 and 25 years. what I am doing scraping the

Cannot install lxml on windows, fatal error C1083: Cannot open include file: 'libxml/xmlversion.h'

﹥>﹥吖頭↗ 提交于 2019-12-21 05:08:28
问题 Python noob, please bear with me. I used python installer for v3.5.1 from www.python.org. My intent was to use Scrapy to run some scripts. pip install scrapy failed, as did easy_install scrapy and others. I traced the error to a faulty install of lxml. Here is the error log. I've even tried easy_installing libxml2, I'm not sure how to proceed. Building lxml version 3.5.0. Building without Cython. ERROR: b"'xslt-config' is not recognized as an internal or external command,\r\noperable program

关于scrapy的爬虫重新请求的问题

我们两清 提交于 2019-12-21 05:07:32
关于scrapy的爬虫重新请求的问题 解释器和库的版本 问题: 解释器和库的版本 python版本:3.7 scrapy版本:1.6.0 问题: 在进行请求获取响应时,会遇到异常响应,此时如何通过中间件重新发起请求。 来源: CSDN 作者: qq_42553453 链接: https://blog.csdn.net/qq_42553453/article/details/103588419

How to create a pg_trgm index using SQLAlchemy for Scrapy?

筅森魡賤 提交于 2019-12-21 05:02:07
问题 I am using Scrapy to scrape data from a web forum. I am storing this data in a PostgreSQL database using SQLAlchemy. The table and columns create fine, however, I am not able to have SQLAlchemy create an index on one of the columns. I am trying to create a trigram index (pg_trgm) using gin. The Postgresql code that would create this index is: CREATE INDEX description_idx ON table USING gin (description gin_trgm_ops); The SQLAlchemy code I have added to my models.py file is: desc_idx = Index(

scrapy djangoitem with Foreign Key

孤人 提交于 2019-12-21 05:00:14
问题 This question was asked here Foreign Keys on Scrapy without an accepted answer, so I am here to re-raise the question with a clearer defined minimum set up: The django model: class Article(models.Model): title = models.CharField(max_length=255) content = models.TextField() category = models.ForeignKey('categories.Category', null=True, blank=True) Note how category is defined is irrelevant here, but it does use ForeignKey . So, in django shell, this would work: c = Article(title="foo", content

Scrapy:In a request fails (eg 404,500), how to ask for another alternative request?

大城市里の小女人 提交于 2019-12-21 04:33:07
问题 I have a problem with scrapy. In a request fails (eg 404,500), how to ask for another alternative request? Such as two links can obtain price info, the one failed, request another automatically. 回答1: Use "errback" in the Request like errback=self.error_handler where error_handler is a function (just like callback function) in this function check the error code and make the alternative Request. see errback in the scrapy documentation: http://doc.scrapy.org/en/latest/topics/request-response

scrapy爬虫框架学习

▼魔方 西西 提交于 2019-12-21 04:20:41
文章目录 一、配置环境: 1.安装插件: (1)twisted (2)scrapy 二、创建项目 三、实战 1.创建项目: 2.创建爬虫 3.打开项目 4.定义字段 5.编写爬虫文件 6.数据处理 7.更改配置 8.运行程序 9.翻页 10.数据保存到MySQL 一、配置环境: 1.安装插件: (1)twisted 虽然安装scrapy时会自动安装,但是安装的不全,所以还是先自己安装比较好 下载。根据自己的python版本和系统版本下载 twisted 安装。dos命令进入到twisted安装包的文件路径下,执行以下命令 pip install 文件名.whl (2)scrapy 使用管理员身份运行cmd,执行以下命令 pip install scrapy 出现 Successfully installed 字样就代表安装成功了 二、创建项目 scrapy startproject 项目名称 如 scrapy startproject SearchSpider 然后进入项目中spiders目录下 如: \\SearchSpider\SearchSpider\spiders 输入 scrapy genspider 爬虫名称 "爬取的域名" 如 scrapy genspider search "baidu.com" 三、实战 我们来爬取网易财经的新股数据:http://quotes

安装python爬虫scrapy踩过的那些坑和编程外的思考

99封情书 提交于 2019-12-21 03:47:21
简介 Elastalert是用python2写的一个报警框架(目前支持python2.6和2.7,不支持3.x),github地址为 https://github.com/Yelp/elastalert。他提供不同场景的规则配置,若觉得规则、告警不满足需求时,可以用python编写插件Adding a New Rule Type、Adding a New Alerter。 环境 系统:centos6.8 python:2.7.12( 请参看升级centos6 默认python版本到2.7.12 ) elasticsearch:5.5 kibana:5.5 Elastalert内置的告警方式: Email JIRA OpsGenie Commands HipChat MS Teams Slack Telegram AWS SNS VictorOps PagerDuty Exotel Twilio Gitter 安装 pip安装elastalert 安装pip包管理工具( 参考 ) $ pip install elastalert 1 或者 git clone (推荐) $ git clone https://github.com/Yelp/elastalert.git 1 安装模块 $ pip install "setuptools>=11.3" $ python setup.py