安装scrapy:
Windows:
a. pip3 install wheel
b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
c. 进入下载目录,执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl
d. pip3 install pywin32
e. pip3 install scrapy
一、scrapy的基本使用
1. 创建一个工程 scrapy startproject firstBlood 2. 切换到工程目录中 cd proName 3. 新建一个爬虫文件 scrapy genspider first www.example.com first:是文件名 www.example.com :这是爬取的起始url 4. 执行工程:scrapy crawl spiderName
创建的爬虫文件说明
# -*- coding: utf-8 -*-
import scrapy
class FirstSpider(scrapy.Spider):
# 爬虫文件的唯一标识
name = 'first'
#表示允许的域名,用来做限定
#allowed_domains = ['www.example.com']
#其实url列表:只能存放url
#作用:列表中存放的url可以被scrapy进行请求发送
start_urls = ['http://baidu.com/','www.sogou.com']
#用于数据解析
def parse(self, response):
print(response)
配置文件settings一般修改三个地方
#1.robots协议,表示不准从robots协议 ROBOTSTXT_OBEY = False #2 UA伪装 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36' #添加日志级别 LOG_LEVEL = 'ERROR'
二、基于管道持久化存储
1. 在爬虫文件中做数据解析
2. 将解析到的数据封装存储到Item类型的对象中
3. 将Item类型的对象提交给管道
4. 在管道中进行任意形式的持久化存储
5. 在配置文件中开启管道
2.1 爬煎蛋网设计页面的标题和内容并持久化存储
爬虫文件代码:
# -*- coding: utf-8 -*-
import scrapy
from JiandanPro.items import JiandanproItem
class JiandanSpider(scrapy.Spider):
name = 'jiandan'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://jandan.net/tag/设计']
def parse(self, response):
div_list = response.xpath('//*[@id="content"]/div')
for div in div_list:
title = div.xpath('./div/h2/a/text()').extract_first()
content = div.xpath('.//div[@class="indexs"]/text()').extract()
content = ''.join(content)
if title and content:
item = JiandanproItem()
item['title'] = title
item['content'] = content
yield item #将item提交给管道
items代码
import scrapy
class JiandanproItem(scrapy.Item):
# define the fields for your item here like:
#Field是一个万能的数据类型
title = scrapy.Field()
content = scrapy.Field()
pipelines代码
class JiandanproPipeline(object):
fp = None
#重写父类的一个方法
def open_spider(self,spider):
self.fp = open('./data.txt','w',encoding='utf-8')
print('i am openSpider,我只会被调用一次!')
#用来接收item并且对其进行任意持久化存储
#pip install -U redis==2.10.6
def process_item(self, item, spider):
title = item['title']
content = item['content']
self.fp.write(title+':'+content+'\n')
return item #item传递给下一个即将被执行的管道类
def close_spider(self, spider):
self.fp.close()
print('i am close_spider,我只会被调用一次!')
settings配置
ITEM_PIPELINES = {
'JiandanPro.pipelines.JiandanproPipeline': 300,
}
2.2 基于mysql做数据备份
1. 一个管道类负责将数据写入一个平台
2. 爬虫文件提交的item只会提交给优先级最高的管道类
3. 如果使得所有的管道都可以接受到item呢?
在process_item方法中,进行item的返回即可
创建数据库和表 create database spider; use spider create table jiandan(title varchar(300),content varchar(500));
pipeline代码
import pymysql
class MysqlPipeing(object):
conn = None #连接对象
cursor = None
def open_spider(self,spider):
self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='wang',db='spider',charset='utf8')
print(self.conn)
def process_item(self,item,spider):
title = item['title']
content = item['content']
self.cursor = self.conn.cursor()
sql = 'insert into jiandan values ("%s","%s")'%(title,content)
try:
self.cursor.execute(sql)
self.conn.commit()
except Exception as e:
print(e)
self.conn.rollback()
return item
def close_spider(self,spider):
self.cursor.close()
self.conn.close()
redis做数据持久化
from redis import Redis
class RedisPipeline(object): conn = None def open_spider(self,spider): self.conn = Redis(host='127.0.0.1',port=6379) def process_item(self,item,spider): #将字典向redis中写会报错 # pip install -U redis == 2.10.6 self.conn.lpush('dataList',item) print(item) return item
ps: 需要在settings里面注册才能生效
三、手动请求发送和全站数据爬取
yield scrapy.Request(url,callback) 发get请求
yield scrapy.FormRequest(new,formdata,callback) 发post请求
案列 爬取:http://wz.sun0769.com/index.php/question/questionType?page= 前六页
import scrapy
class SunSpider(scrapy.Spider):
name = 'sun'
#allowed_domains = ['www.xx.com']
start_urls = ['http://wz.sun0769.com/index.php/question/questionType?page=']
url = 'http://wz.sun0769.com/index.php/question/questionType?page=%d'
page = 30
def parse(self, response):
tr_list= response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr')
for tr in tr_list:
title = tr.xpath('./td[2]/a[2]/text()').extract_first()
print(title)
if self.page <= 150:
new_url = format(self.url%self.page)
self.page += 30
#手动请求发送
yield scrapy.Request(new_url,callback=self.parse)
四 、五大核心组件
1. 引擎:
用来处理整个系统的数据流处理,触发事件(框架核心)
2. 调度器:
用来接受引擎发过来的请求,放入队列中,并在引擎再次请求的时候返回
- 过滤器
-队列
3. 下载器:
用于下载网页内容,并将网页内容返回,下载器是建立在twisted这个高效异步模型之上
4. 爬虫(spider):
用于在特定的网页中提取自己需要的信息
5. 管道(pipeline):
负责处理爬虫从网页中国提取的实体,主要功能是持久化存储
五、请求传参
作用:让scrapy实现深度爬取
深度爬取:抓取的数据没有存储在同一张页面中
-通过scrapy.Request(url,callback,meta)中的meta字典传递
-在callback中通过respnse.meta接受meta这个字典
爬虫代码:
# -*- coding: utf-8 -*-
import scrapy
from RequestSendPro.items import RequestsendproItem
class ParamdemoSpider(scrapy.Spider):
name = 'paramDemo'
#allowed_domains = ['www.xx.com']
start_urls = ['http://wz.sun0769.com/index.php/question/questionType?page=']
url = 'http://wz.sun0769.com/index.php/question/questionType?page=%d'
page = 30
def parse(self, response):
tr_list = response.xpath('//*[@id="morelist"]/div/table[2]//tr/td/table//tr')
for tr in tr_list:
item = RequestsendproItem()
title = tr.xpath('./td[2]/a[2]/text()').extract_first()
item['title'] = title
detail_url = tr.xpath('./td[2]/a[2]/@href').extract_first()
# 对详情页的url发情请求
#meta这个字典的含义是可以将字典传递给callback
yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item})
if self.page <= 150:
new_url = format(self.url%self.page)
self.page += 30
yield scrapy.Request(new_url,callback=self.parse)
#解析新闻内容
def parse_detail(self,response):
#接收meta
item = response.meta['item']
content = response.xpath('/html/body/div[9]/table[2]//tr[1]/td//text()').extract()
content = ''.join(content)
item['content'] = content
yield item
六、中间件
种类:
1.下载中间件
2. 爬虫中间件
作用:批量拦截请求和响应
为什么需要啊拦截请求:
- 设定代理
process_exception():
request.meta['proxy'] = 'http://ip:port'
- 篡改请求头信息(UA)
process_headers['User_Agent'] = 'xxx'
meddlewares 代码:

# -*- coding: utf-8 -*-
from scrapy import signals
import random
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
"(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
class MiddleproDownloaderMiddleware(object):
#拦截请求
#spider:爬虫类实例化的对象
def process_request(self, request, spider):
print('I am process_request')
# 基于UA池进行UA伪装
request.headers['User-Agent'] = random.choices(user_agent_list)
#代理
# request.meta['proxy'] = 'https://58.246.228.218:1080'
return None
#拦截所有的响应
def process_response(self, request, response, spider):
return response
#拦截异常的请求
def process_exception(self, request, exception, spider):
print('I am process_exception')
# 代理
request.meta['proxy'] = 'https://58.246.228.218:1080'
#将修正后的对象进行重新发送
#return request
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
爬虫文件代码:

1 class MiddleSpider(scrapy.Spider):
2 name = 'middle'
3 #allowed_domains = ['www.xx.com']
4 start_urls = ['https://www.baidu.com/s?wd=ip']
5
6 def parse(self, response):
7 page_text = response.text
8 ips = response.xpath('//*[@id="1"]/div[1]/div[1]/div[2]/table//tr/td/span').extract_first()
9 print(ips)
10 with open('ip.txt','w',encoding='utf-8') as f:
11 f.write(page_text)
七、CrawlSpider
概述:是spider的一个子类
作用:用于实际全站数据爬取
使用:
1. 创建工程
2. cd ProName
3. scrapy genspider -t crawl spiderName 起始url
连接提取器(LinkExtractor): 可以根据指定的规则进行指定的链接的提取
规则解析器(Rule):可以将LinkExtractor提取出的链接进行请求发送,然后根据指定做数据解析
爬虫文件代码:
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.linkextractors import LinkExtractor 4 from scrapy.spiders import CrawlSpider, Rule 5 6 class MovieSpider(CrawlSpider): 7 name = 'movie' 8 #allowed_domains = ['www.xx.com'] 9 start_urls = ['https://www.4567tv.tv/index.php/vod/show/class/动作/id/1.html'] 10 # 连接提取器 11 # 作用:根据指定规则(allow)进行连接(url)的提取 12 13 link = LinkExtractor(allow=r'id/1/page/\d+\.html') 14 rules = ( 15 #实例化一个Rule类型的对象 16 #Rule: 规则解析器 17 # 作用:可以对链接提取器提取到的链接进行请求发送,按照指定规则进行数据解析 18 Rule(link, callback='parse_item', follow=True), 19 ) 20 21 #用于数据解析 22 def parse_item(self, response): 23 # 解析 24 print(response)
八、分布式
概念:组建一个分布式机群,然后让其执行同一组程序,联合爬取同一个资源中的数据
实现方式:scrapy+redis(scrapy和scrapy_redis组件)
原生scrapy不可以实现共享的原因:
1. 调度器不可以被共享
2. 管道不可以被共享
scrapy_redis组件作用:
可以提供被共享的管道和调度器
环境安装:
pip install scrapy-redis
编码流程:
修改爬虫文件:
1.导包:from scrapy_redis.spiders import RedisCrawlSpider
2.修改爬虫文件父类
3. 删除start_url和
4.添加一个redis_key的属性,属性值任意字符串即可
5. 爬虫文件的常规操作
6. 编写配置文件settings
- 指定管道:
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline':400
}
- 指定调度器:
#增加了一个去重容器类的配置
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
#使用scrapy_redis自建自己的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
#配置调度器是否要持久化
SCHEDULER_PERSIST = True REDIS_HOST = '192.168.2.201' REDIS_PORT = 6379 - 修改redis的配置文件 redis.windows.conf - 启动redis的服务端和客户端 - 将起始url放入到可以被共享的调度队列中 - 队列是存在于redis数据库中 - redis-cli未完待续......
