一、概述

1.1、通过pycharm创建一个scrapy工程

1、参考下面的博客创建scrapy工程

pycharm创建scrapy项目

2、项目目录如下

在这里插入图片描述

3、文件说明

scrapy.cfg ：项目的配置信息，主要为Scrapy命令行工具提供一个基础的配置信息。（真正爬虫相关的配置信息在settings.py文件中）
items.py ：设置数据存储模板，用于结构化数据，如：Django的Model
pipelines ：数据处理行为，如：一般结构化的数据持久化
settings.py ：配置文件，如：递归的层数、并发数，延迟下载等
spiders ：爬虫目录，如：创建文件，编写爬虫规则

1.2、编写工程启动类

1、自动生成网站爬虫spider类

使用scrapy genspider 命令生成爬取豆瓣网站的爬虫类
```
scrapy genspider douban https://book.douban.com/
```
示例：
name: 用于区别Spider。该名字必须是唯一的，您不可以为不同的Spider设定相同的名字。
start_urls: 包含了Spider在启动时进行爬取的url列表。因此，第一个被获取到的页面将是其中之一。后续的URL则从初始的URL获取到的数据中提取。
parse() :是spider的一个方法。被调用时，每个初始URL完成下载后生成的 Response 对象将会作为唯一的参数传递给该函数。该方法负责解析返回的数据(response data)，提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象。

2、编写scrapy程序启动类

在scrapy.cfg文件同级目录下建立main.py文件，内容如下:

import sys
import os
from scrapy.cmdline import execute

sys.path.append(os.path.dirname(os.path.abspath(__file__)))
execute(['scrapy', 'crawl', 'douban'])

注意：execute中必须是数组且第三个参数的值必须与要运行的scarpy爬虫类中的name值相等，在这里就是douban。

3、设置robots

在setting.py中将ROBOTSTXT_OBEY 设置为False
```
ROBOTSTXT_OBEY = False
```
ROBOTSTXT_OBEY的值默认为True，就是要遵守robots.txt 的规则， robots.txt 是遵循 Robot协议的一个文件，它保存在网站的服务器中，它的作用是，告诉搜索引擎爬虫，本网站哪些目录下的网页不希望你进行爬取收录。在Scrapy启动后，会在第一时间访问网站的 robots.txt 文件，然后决定该网站的爬取范围。
从网页中提取数据有很多方法，Scrapy使用了一种基于 XPath 和 CSS 表达式机制: Scrapy Selectors 。

4、Scrapy Selectors机制

Selector有四个基本的方法：
1、xpath(): 传入xpath表达式，返回该表达式所对应的所有节点的selector list列表。
2、css(): 传入CSS表达式，返回该表达式所对应的所有节点的selector list列表.
3、extract(): 序列化该节点为unicode字符串并返回list。
4、re(): 根据传入的正则表达式对数据进行提取，返回unicode字符串list列表。

5、在Shell中尝试Selector选择器：

注意：在启动shell之前我们要先安装pywin32和ipython

在工程目录下运行下列命令启动shell：

scrapy shell https://read.douban.com/ebook/123133635/

shell的主要作用：动态调试xpath语法和css语法

1.3、xpath语法

1、常用路径表达式

表达式	描述
nodename	选取此节点的所有子节点。
/	从根节点选取。
//	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。
.	选取当前节点。
…	选取当前节点的父节点。
@	选取属性。

示例：

路径表达式	结果
bookstore	选取 bookstore 元素的所有子节点。
/bookstore	选取根元素 bookstore。注释：假如路径起始于正斜杠( / )，则此路径始终代表到某元素的绝对路径！
bookstore/book	选取属于 bookstore 的子元素的所有 book 元素。
//book	选取所有 book 子元素，而不管它们在文档中的位置。
bookstore//book	选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置。
//@lang	选取名为 lang 的所有属性。

2、谓语

定义：谓语用来查找某个特定的节点或者包含某个指定的值的节点；谓语被嵌在方括号中。
示例：

路径表达式	结果
/bookstore/book[1]	选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1]	选取属于 bookstore 子元素的倒数第二个 book 元素。
/bookstore/book[position()< 3]	选取最前面的两个属于 bookstore 元素的子元素的 book 元素。
//title[@lang]	选取所有拥有名为 lang 的属性的 title 元素。
//title[@lang=‘eng’]	选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性。
/bookstore/book[price>35.00]	选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。
/bookstore/book[price>35.00]/title	选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。

3、选取未知节点

通配符	描述
*	匹配任何元素节点。
@*	匹配任何属性节点。
node()	匹配任何类型的节点。

4、选取若干路径

路径表达式	结果
//book/title \| //book/price	选取 book 元素的所有 title 和 price 元素。
//title \| //price	选取文档中的所有 title 和 price 元素。
/bookstore/book/title \| //price	选取属于 bookstore 元素的 book 元素的所有 title 元素，以及文档中所有的 price 元素。

5、xpath常用函数

函数名	说明	例子
boolean()	返回数字、字符串或节点集的布尔值
ceiling()	返回大于 num 参数的最小整数	ceiling(3.14) 返回 4
concat()	返回字符串的拼接	concat('XPath ','is ',‘FUN!’) 返回 ‘XPath is FUN!’
contains()	如果 string1 包含 string2，则返回 true，否则返回 false
count()	返回节点的数量
false()	返回布尔值 false
floor()	返回不大于 num 参数的最大整数	floor(3.14) 返回 3
lang(lang)	如果当前节点的语言匹配指定的语言，则返回 true	Lang(“en”) is true for
last()	返回在被处理的节点列表中的项目数目	//book[last()] 返回：选择最后一个 book 元素
local-name()	返回当前节点的名称或指定节点集中的第一个节点 - 不带有命名空间前缀
name()	返回当前节点的名称或指定节点集中的第一个节点
namespace-uri()	返回当前节点或指定节点集中第一个节点的命名空间 URI
normalize-space()	删除指定字符串的开头和结尾的空白，并把内部的所有空白序列替换为一个，然后返回结果。如果没有 string 参数，则处理当前节点
not()	首先通过 boolean() 函数把参数还原为一个布尔值。如果该布尔值为 false，则返回 true，否则返回 true
number(arg)	返回参数的数值。参数可以是布尔值、字符串或节点集
position()	返回当前正在被处理的节点的 index 位置
round()	把 num 参数舍入为最接近的整数	例子：round(3.14) 返回：3
starts-with()	如果 string1 以 string2 开始，则返回 true，否则返回 false	starts-with(‘XML’,‘X’) 返回：true
string()	返回参数的字符串值。参数可以是数字、逻辑值或节点集	string(314) 返回：“314”
string-length()	返回指定字符串的长度。如果没有 string 参数，则返回当前节点的字符串值的长度
substring()	返回从 start 位置开始的指定长度的子字符串。第一个字符的下标是 1。如果省略 len 参数，则返回从位置 start 到字符串末尾的子字符串	substring(‘Beatles’,1,4) 返回：‘Beat’
substring-after()	返回 string2 在 string1 中出现之后的子字符串	substring-after(‘12/10’,’/’) 返回：‘10’
substring-before()	返回 string2 在 string1 中出现之前的子字符串	substring-before(‘12/10’,’/’) 输出：‘12’
sum()	返回指定节点集中每个节点的数值的总和
translate()	把 string1 中的 string2 替换为 string3	translate(‘12:30’,‘30’,‘45’) 返回：‘12:45’
true()	返回布尔值 true

1.4、测试启动

1、启动项目我们发现访问豆瓣网页出现403的错误

错误结果如下：

解决方案：在setting.py文件中增加USER_AGENT配置

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'

2、debug测试

body中的内容就是我们要解析的内容：返回为200说明访问豆瓣网页请求成功

1.5、css选择器

1、语法

"CSS"列表示在CSS版本的属性定义（CSS1，CSS2，或对CSS3）。

选择器	示例	示例说明	CSS
.class	.intro	选择所有class="intro"的元素	1
#id	#firstname	选择所有id="firstname"的元素	1
*	*	选择所有元素	2
element	p	选择所有元素	1
element,element	div,p	选择所有元素和元素	1
element element	div p	选择元素内的所有元素	1
element>element	div>p	选择所有父级是元素的元素	2
element+element	div+p	选择所有紧接着元素之后的元素	2
[attribute]	[target]	选择所有带有target属性元素	2
[attribute=value]	[target=-blank]	选择所有使用target="-blank"的元素	2
[attribute~=value]	[title~=flower]	选择标题属性包含单词"flower"的所有元素	2
[attribute	=language]	[lang	=en]
:link	a:link	选择所有未访问链接	1
:visited	a:visited	选择所有访问过的链接	1
:active	a:active	选择活动链接	1
:hover	a:hover	选择鼠标在链接上面时	1
:focus	input:focus	选择具有焦点的输入元素	2
:first-letter	p:first-letter	选择每一个元素的第一个字母	1
:first-line	p:first-line	选择每一个元素的第一行	1
:first-child	p:first-child	指定只有当元素是其父级的第一个子级的样式。	2
:before	p:before	在每个元素之前插入内容	2
:after	p:after	在每个元素之后插入内容	2
:lang(language)	p:lang(it)	选择一个lang属性的起始值="it"的所有元素	2
element1~element2	p~ul	选择p元素之后的每一个ul元素	3
[attribute^=value]	a[src^=“https”]	选择每一个src属性的值以"https"开头的元素	3
[attribute$=value]	a[src$=".pdf"]	选择每一个src属性的值以".pdf"结尾的元素	3
[attribute*=value]	a[src*=“runoob”]	选择每一个src属性的值包含子字符串"runoob"的元素	3
:first-of-type	p:first-of-type	选择每个p元素是其父级的第一个p元素	3
:last-of-type	p:last-of-type	选择每个p元素是其父级的最后一个p元素	3
:only-of-type	p:only-of-type	选择每个p元素是其父级的唯一p元素	3
:only-child	p:only-child	选择每个p元素是其父级的唯一子元素	3
:nth-child(n)	p:nth-child(2)	选择每个p元素是其父级的第二个子元素	3
:nth-last-child(n)	p:nth-last-child(2)	选择每个p元素的是其父级的第二个子元素，从最后一个子项计数	3
:nth-of-type(n)	p:nth-of-type(2)	选择每个p元素是其父级的第二个p元素	3
:nth-last-of-type(n)	p:nth-last-of-type(2)	选择每个p元素的是其父级的第二个p元素，从最后一个子项计数	3
:last-child	p:last-child	选择每个p元素是其父级的最后一个子级。	3
:root	:root	选择文档的根元素	3
:empty	p:empty	选择每个没有任何子级的p元素（包括文本节点）	3
:target	#news:target	选择当前活动的#news元素（包含该锚名称的点击的URL）	3
:enabled	input:enabled	选择每一个已启用的输入元素	3
:disabled	input:disabled	选择每一个禁用的输入元素	3
:checked	input:checked	选择每个选中的输入元素	3
:not(selector)	:not§	选择每个并非p元素的元素	3
::selection	::selection	匹配元素中被用户选中或处于高亮状态的部分	3
:out-of-range	:out-of-range	匹配值在指定区间之外的input元素	3
:in-range	:in-range	匹配值在指定区间之内的input元素	3
:read-write	:read-write	用于匹配可读及可写的元素	3
:read-only	:read-only	用于匹配设置 “readonly”（只读）属性的元素	3
:optional	:optional	用于匹配可选的输入元素	3
:required	:required	用于匹配设置了 “required” 属性的元素	3
:valid	:valid	用于匹配输入值为合法的元素	3
:invalid	:invalid	用于匹配输入值为非法的元素	3

1.6、scrapy框架简介

原理图

Scrapy Engine(Scrapy核心) 负责数据流在各个组件之间的流。
Spiders(爬虫)发出Requests请求，经由Scrapy Engine(Scrapy核心) 交给Scheduler(调度器)，
Downloader(下载器)Scheduler(调度器) 获得Requests请求，然后根据Requests请求，从网络下载数据。
Downloader(下载器)的Responses响应再传递给Spiders进行分析。根据需求提取出Items，交给Item
Pipeline进行下载。Spiders和Item Pipeline是需要用户根据响应的需求进行编写的。
除此之外，还有两个中间件，Downloaders Mddlewares和Spider Middlewares，这两个中间件为用户提供方面，通过插入自定义代码扩展Scrapy的功能，例如去重等。

二、爬虫实战

1.1、爬取豆瓣小说简介

1、项目需求：

获取下列数据：
1、标题
2、二级标题
3、作者、译者、类别、出版社、出版日期、提供方、字数、isbn编码、豆瓣评分
4、全本定价
5、作品简介

1.2、使用xpath方式获取小说简介详情

  '''
    根据xpath获取文章
    '''
    def get_article_byxpath(self,response):
        # 标题
        title = response.xpath("//*[@class = 'main']/article/div[1]/div[@class = 'article-profile-bd']/h1/text()").get()
        # 引言
        subtitle = response.xpath(
            "//*[@class = 'main']/article/div[1]/div[@class = 'article-profile-bd']/p/text()").get()
        # 作者
        author = response.xpath(
            "//*[@class = 'main']/article/div[1]/div[@class = 'article-profile-bd']/div[@class='article-meta']/p[@class='author']//a[@class='author-item']/text()").get()
        # 译者
        translator = response.xpath(
            "//*[@class = 'main']/article/div[1]/div[@class = 'article-profile-bd']/div[@class='article-meta']/p[@class='translator']//a[@class='author-item']/text()").get()
        # 类别
        category = response.xpath(
            "//*[@class = 'main']/article/div[1]/div[@class = 'article-profile-bd']/div[@class='article-meta']/p[@class='category']//span[@itemprop='genre']/text()").get()
        # 出版社
        publish_house = response.xpath(
            "//*[@class = 'main']/article/div[1]/div[@class = 'article-profile-bd']/div[@class='article-meta']/p[4]/span[@class='labeled-text']/span[1]/text()").get()
        # 出版日期
        publish_time = response.xpath(
            "//*[@class = 'main']/article/div[1]/div[@class = 'article-profile-bd']/div[@class='article-meta']/p[4]/span[@class='labeled-text']/span[2]/text()").get()
        # 提供方
        provider = response.xpath(
            "//*[@class = 'main']/article/div[1]/div[@class = 'article-profile-bd']/div[@class='article-meta']/p[5]//a[@itemprop='provider']/text()").get()
        # 字数
        word_numbers = response.xpath(
            "//*[@class = 'main']/article/div[1]/div[@class = 'article-profile-bd']/div[@class='article-meta']/p[6]/span[@class='labeled-text']/text()").get()
        word_numbers = re.findall("[\d,\,]+", word_numbers)[0]
        # 编号
        isbn = response.xpath(
            "//*[@class = 'main']/article/div[1]/div[@class = 'article-profile-bd']/div[@class='article-meta']/p[7]//a[@itemprop='isbn']/text()").get()
        # 评分
        grade = response.xpath(
            "//*[@class = 'main']/article/div[1]/div[@class = 'article-profile-bd']/div[2]/span[2]/text()").get()
        # 单价
        price = response.xpath(
            "//*[@class = 'main']/article/div[2]/div[1]/span[1]//span[@class='current-price-count']/text()").get()
        price = re.findall("[\d,\.]+", price)[0]
        # 作品简介
        introduction_list = response.xpath(
            "//*[@class = 'main']/article/section/section[1]/div[@itemprop='description']//p/text()").getall()
        introduction = ""
        for intro in introduction_list:
            introduction = introduction + intro;
            # print(intro)
        # 作品图片URL
        img_url = response.xpath("//div[@class='main']/article/div[1]//img[1]/@src").get()
        novel = NovelItem()
        novel['title'] = title
        novel['subtitle'] = subtitle
        novel['author'] = author
        novel['translator'] = translator
        novel['category'] = category
        novel['publish_house'] = publish_house
        novel['publish_time'] = publish_time
        novel['provider'] = provider
        novel['word_numbers'] = word_numbers
        novel['isbn'] = isbn
        novel['grade'] = grade
        novel['price'] = price
        novel['introduction'] = introduction
        novel['img_url'] = img_url
        print(novel)

1.3、使用css方式获取小说简介详情

 '''
    根据css样式获取文章
    '''
    def get_article_bycss(self,response):
        # 标题
        title = response.css(".main article div.article-profile-section .article-profile-bd h1::text").get()
        # 引言
        subtitle = response.css(".main article div.article-profile-section .article-profile-bd p::text").get()
        # 作者
        author = response.css(
            ".main article div.article-profile-section .article-profile-bd .article-meta .author a.author-item::text").get()
        # 译者
        translator = response.css(".main article div.article-profile-section .article-profile-bd .article-meta .translator a.author-item::text").get()
        # 类别
        category = response.css(".main article div.article-profile-section .article-profile-bd .article-meta .category span.labeled-text span::text").get()
        # 出版社
        publish_house = response.css(".main article div.article-profile-section .article-profile-bd .article-meta p:nth-child(6) .labeled-text span:nth-child(1)::text").get()
        # 出版时间
        publish_time = response.css(".main article div.article-profile-section .article-profile-bd .article-meta p:nth-child(6) .labeled-text span:nth-child(2)::text").get()
        # 提供方
        provider = response.css(".main article div.article-profile-section .article-profile-bd .article-meta p:nth-child(7) .labeled-text a::text ").get()
        # 字数
        word_numbers = response.css(".main article div.article-profile-section .article-profile-bd .article-meta p:nth-child(8) .labeled-text::text").get()
        word_numbers = re.findall("[\d,\,]+", word_numbers)[0]
        # 编号
        isbn =  response.css(".main article div.article-profile-section .article-profile-bd .article-meta p:nth-child(9) .labeled-text a::text").get()
        # 评分
        grade =  response.css(".main article div.article-profile-section .article-profile-bd .rating .score::text").get()
        # 单价
        price = response.css(".main article .profile-purchase-container .profile-purchase-actions span.current-price-count::text").get()
        price = re.findall("[\d,\.]+", price)[0]
        # 作品简介
        introduction_list = response.css(".main article .article-profile-section section:nth-child(1) div[itemprop='description'] .info p::text").getall()
        introduction = ""
        for intro in introduction_list:
            introduction = introduction + intro;
            # print(intro)
        # 图片url地址
        img_url =  response.css(".main article div:nth-child(1) div.cover img::attr(src)").get()
        novel = NovelItem()
        novel['title'] = title
        novel['subtitle'] = subtitle
        novel['author'] = author
        novel['translator'] = translator
        novel['category'] = category
        novel['publish_house'] = publish_house
        novel['publish_time'] = publish_time
        novel['provider'] = provider
        novel['word_numbers'] = word_numbers
        novel['isbn'] = isbn
        novel['grade'] = grade
        novel['price'] = price
        novel['introduction'] = introduction
        novel['img_url'] = img_url
        print(novel)

1.4、使用scrapy的item存储数据

1、item定义

Item 是保存爬取到的数据的容器；其使用方法和python字典类似，并且提供了额外保护机制来避免拼写错误导致的未定义字段错误。
类似在ORM中做的一样，您可以通过创建一个item类继承scrapy.Item ，并且定义类型为 scrapy.Field 的类属性来定义一个Item；item中只有 scrapy.Field一种类型。

2、编辑items.py文件

import scrapy

class NovelItem(scrapy.item):
    title = scrapy.Field()
    subtitle = scrapy.Field()
    author = scrapy.Field()
    translator = scrapy.Field()
    category = scrapy.Field()
    publish_house = scrapy.Field()
    publish_time = scrapy.Field()
    provider = scrapy.Field()
    word_numbers = scrapy.Field()
    isbn = scrapy.Field()
    grade = scrapy.Field()
    price = scrapy.Field()
    introduction = scrapy.Field()
    img_url = scrapy.Field()

1.5、修改parse方法

# -*- coding: utf-8 -*-
import scrapy
import re
from BlogScrapy.items import NovelItem

class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['https://book.douban.com/']
    # 爬取豆瓣的小说页
    start_urls = ['https://read.douban.com/ebook/123133635/']
    base_url = 'https://read.douban.com/'

    def parse(self, response):
        self.get_article_bycss(response)
        self.get_article_byxpath(response)

测试结果：通过xpath和css的方式都爬取到了“82年生的金智英”这篇文章的基本信息

1.6、扩展

1、需求

爬取小说首页的所有文章的基本信息

2、启动scrapy shell调试

在这里插入图片描述

问题：我们发现根本获取不到小说列表的html内容，也就没有办法获得每个文章的url
原因：豆瓣小说以及电影都是js动态生成的html页面，所以我们需要利用scrapy-splash爬取JS生成的动态页面，这一部分我们在下一章节实现
注意：当我们爬取次数过多是豆瓣的反爬机制会导致我们启动scrapy shell报403的错误，具体解决方案参考这篇博客Scrapy shell调试返回403错误解决方案

来源：CSDN

作者：阿川xiang

链接：https://blog.csdn.net/weixin_43054590/article/details/103028245

标签

字符串函数

python爬虫

scrapy

response

scrapy爬虫之爬取豆瓣小说简介（七）

一、概述

1.1、通过pycharm创建一个scrapy工程

1、参考下面的博客创建scrapy工程

2、项目目录如下

3、文件说明

1.2、编写工程启动类

1、自动生成网站爬虫spider类

2、编写scrapy程序启动类

3、设置robots

4、Scrapy Selectors机制

5、在Shell中尝试Selector选择器：

1.3、xpath语法

1、常用路径表达式

2、谓语

3、选取未知节点

4、选取若干路径

5、xpath常用函数

1.4、测试启动

1、启动项目我们发现访问豆瓣网页出现403的错误

2、debug测试

1.5、css选择器

1、语法

1.6、scrapy框架简介

二、爬虫实战

1.1、爬取豆瓣小说简介

1、项目需求：

1.2、使用xpath方式获取小说简介详情

1.3、使用css方式获取小说简介详情

1.4、使用scrapy的item存储数据

1、item定义

2、编辑items.py文件

1.5、修改parse方法

1.6、扩展

1、需求

2、启动scrapy shell调试