scrapy | 易学教程

Scrapy分布式原理

阅读更多关于 Scrapy分布式原理

scrapy架构首先我们先看一下Scrapy的单机架构也就是说scrapy的单机架构实际上实在本机维护一个爬取队列，用Scheduler进行调度，如果我们要实现scarpy的分布式，就需要多台主机协同操作，那么问题来了多台主机协作的关键是什么？实际上就是共享爬取队列：核心就是将这个队列进行共享，让多台主机都能访问，然后让各个主机的Scheduler进行调度，这样就可以共享requests，进行统一的抓取。单主机爬虫架构: 主机从Queue中抓取队列，然后由Scheduler调度分布式爬虫架构: 由多个Scheduler从同一个Queue中调度，这样就可以完成协同的抓取前面写到的这些，都离不开队列，那么怎么维护队列呢队列用什么维护？可能大家会想到使用数据库，使用文件或者一些特定的数据结构来进行维护，这里推荐 redis队列，那么redis有什么好处呢? Redis是非关系型数据库，以Key-Value形式进行存储，相对于其他数据库来说，结构跟灵活是内存中的数据结构存储系统，处理速度快，性能好。提供队列，集合等多种存储结构，分辨队列维护。怎么去重？在进行抓取中，因为有多台主机访问一个队列，所以他们请求到的url可能会相同，那么抓取到的数据可能会一样，那么我们要怎么保证各个主机拿到的requests队列是不重复的呢？这里我们可以使用 Redis集合

scrapy框架爬虫初体验——豆瓣评分top250

阅读更多关于 scrapy框架爬虫初体验——豆瓣评分top250

环境 Scrapy安装 Scrapy抓取步骤第一步：新建项目创建scrapy项目设置settings.py 创建爬虫文件（douban_spider.py）第二步：明确目标打开网站分析抓取内容实现数据结构（items.py）第三步：制作爬虫测试编写解析文件（douban_spider.py的parse()方法）第四步：保存数据存到文件存到数据库其他部分：爬虫的伪装 Ip代理中间件编写（middlewares.py） user-agent中间件编写（middlewares.py）注意事项参考资料环境 win 10 + pycharm + python 3.6 + scrapy 3.2.3 Scrapy安装 pip install scrapy Scrapy抓取步骤第一步：新建项目第二步：明确目标第三步：制作爬虫第四步：存储内容第一步：新建项目创建scrapy项目 scrapy startproject douban 设置settings.py settings.py：定义项目的全局设置。爬虫协议设置： # Obey robots.txt rules ROBOTSTXT_OBEY = True 默认 ROBOTSTXT_OBEY = True ，即遵守此协议；当爬取内容不符合该协议且仍要爬取时，设置 ROBOTSTXT_OBEY =

scrapy入门(一)

阅读更多关于 scrapy入门(一)

scrapy入门一. Terminal命令创建爬虫项目 scrapy startproject spider_project_name #自定义项目名 spiders文件夹中创建爬虫源文件, 也是爬虫主要功能实现的部分 cd spider_project_name #进入项目 scrapy genspider spider_name www.baidu.com #spider_name 新建的爬虫名 #www.baidu.com 域名 #规则爬虫：scrapy genspider -t crawl xxx（爬虫名） xxx.com （爬取域）运行命令：scrapy crawl spider_name或scrapy crawl xxx -o xxx.json 二. 各文件配置及其作用 settings 文件项目的配置文件需要修改的地方有: 19行 : USER_AGENT 修改robots协议为ROBOTSTXT_OBEY = False 添加控制输出日志的语句 : LOG_LEVEL = 'ERROR' 和LOG_FILE = 'log.txt' 67行取消注释, 启用管道存储 ITEM_PIPELINES items文件, item对象用来保存文件需要在item文件中定义属性例如爬取虎牙直播的各个主播的标题, 主播名, 人气 title = scrapy.Field

scrapy之spiders

阅读更多关于 scrapy之spiders

官方文档： https://docs.scrapy.org/en/latest/topics/spiders.html# 一句话总结：spider是定义爬取的动作（是否跟进新的链接）及分析网页结构（提取数据，返回item）的地方。一 scrapy.Spider 　　1 name 　　2 allowed_domins <-----------------------> offsitemiddleware 　　3 start_urls <-----------------------> start_requests() 　　4 custom_settings <-------------------------> Built-in settings reference 　　It must be defined as a class attribute since the settings are updated before instantiation. class BaiduSpider(scrapy.Spider): name = 'baidu' allowed_domains = ['https://www.baidu.com'] start_urls = ['http://https://www.baidu.com/'] custom_settings = { 'user

使用 scrapy 爬取微博热搜

阅读更多关于使用 scrapy 爬取微博热搜

安装 pip install Scrapy 创建项目 scrapy startproject weiboHotSearch 创建爬虫 cd weiboHotSearch scrapy genspider weibo s.weibo.com 编写Item 修改weiboHotSearch中的items.py,添加item import scrapy class WeibohotsearchItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass keyword = scrapy.Field() url = scrapy.Field() count = scrapy.Field() 编写爬虫修改 start_urls ,注意为list格式使用 xpath 解析数据 xpath语法可参考 https://www.w3school.com.cn/xpath/xpath_syntax.asp 解析数据时,可运行 scrapy shell "https://s.weibo.com/top/summary" 调试xpath 引入 Item ,将数据以 Itme 对象返回执行 scrapy crawl weibo 运行爬虫运行结果如下: weibo.py

day99 爬虫 scrapy介绍结构介绍

阅读更多关于 day99 爬虫 scrapy介绍结构介绍

scrapy介绍，架构介绍（框架）ghref scrapy就是爬虫界的django 爬虫框架，别人写好的代码，以后只需要在指定位置写指定代码即可基于twisted：性能很高五大组件引擎：大总管，总的控制数据流动调度器：去重，加入队列下载器：负责下载，加载数据爬虫：主要写这，解析response和重新发起请求项目管道：持久化相关两大中间件爬虫中间件：爬虫和引擎之间（用的少）下载中间件：引擎和下载器之间（加代理，加cookie，修改user-agent,继承selenium） scrapy安装（windows） mac/linux:pip3 install scrapy windows: pip3 install scrapy(大部分都可以) -如果上面不行 -pip3 install wheel （xxx.whl文件安装模块） -下载pywin32：两种方式：1 pip3 install pywin32 2 下一个exe安装https://sourceforge.net/projects/pywin32/files/pywin32/ -下载twisted的wheel文件：http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 下载完是一个xxx.whl文件 -执行pip3 install 下载目录\Twisted

Getting TCP connection timed out: 110: Connection timed out. on AWS while using scrapy?

阅读更多关于 Getting TCP connection timed out: 110: Connection timed out. on AWS while using scrapy?

问题 This is my scrapy code. import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse import urljoin import pymongo import time class CompItem(scrapy.Item): text = scrapy.Field() name = scrapy.Field() date = scrapy.Field() url = scrapy.Field() rating = scrapy.Field() title = scrapy.Field() category = scrapy.Field() source = scrapy.Field() user_info = scrapy.Field() email =

Getting TCP connection timed out: 110: Connection timed out. on AWS while using scrapy?

阅读更多关于 Getting TCP connection timed out: 110: Connection timed out. on AWS while using scrapy?

Getting TCP connection timed out: 110: Connection timed out. on AWS while using scrapy?

阅读更多关于 Getting TCP connection timed out: 110: Connection timed out. on AWS while using scrapy?

python爬虫----（4. scrapy框架，官方文档以及例子）

阅读更多关于 python爬虫----（4. scrapy框架，官方文档以及例子）

官方文档： http://doc.scrapy.org/en/latest/ github例子： https://github.com/search?utf8=%E2%9C%93&q=scrapy 剩下的待会再整理...... 买饭去...... --2014年08月20日19:29:20 の...刚搜狗输入法出问题，直接注销重新登陆，结果刚才的那些内容全部没了。看来草稿箱也不是太靠谱呀！！！再重新整理下吧 -- 2014年08月21日04:02:37 （一）基本的 -- scrapy.spider.Spider （1）使用交互shell dizzy@dizzy-pc:~$ scrapy shell "http://www.baidu.com/" 2014-08-21 04:09:11+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot) 2014-08-21 04:09:11+0800 [scrapy] INFO: Optional features available: ssl, http11, django 2014-08-21 04:09:11+0800 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0} 2014-08-21 04:09

订阅 scrapy