Scrapy实战之爬取网页并保存为json格式文件

PS：小编是为了参加大数据技能大赛而学习网络爬虫，对爬虫感兴趣的可以关注我哦，每周更新一篇~（这周迎来第一个粉丝，为了庆祝多发布一篇✌）

👉直奔主题👈

首先，我们来分析目标网页http://www.bookschina.com/kinder/54290000/
在这里插入图片描述
查看网页源代码，搜索关键字的快捷键是（Ctrl+F）

现在已经找到我们所需的信息块，下面开始go~ go~ go~

开始操作实战

第一步：创建bookstore项目，还是熟悉的三句命令：
（PS：记住是在cmd.exe下执行）

scrapy startproject bookstore

cd bookstore

scrapy genspider store "bookschina.com"

第二步：编写代码
一、编写spider.py模块

# -*- coding: utf-8 -*-
import scrapy
import time
from scrapy import Request,Selector
from bookstore.items import BookstoreItem


class StoreSpider(scrapy.Spider):
    name = 'store'
    # allowed_domains = ['bookschina.com']
    # start_urls = ['http://bookschina.com/']
    next_url = 'http://www.bookschina.com'
    url = 'http://www.bookschina.com/kinder/54290000/' #初始URL
    # 伪装
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64) AppleWebKit/537.36 (KHTML , like Gecko) Chrome/67.0.3396.62 Safari/537.36'}

    def start_requests(self):
        yield Request(url=self.url,headers=self.headers,callback=self.parse)

    def parse(self, response):
        item = BookstoreItem()
        selector = Selector(response)
        books = selector.xpath('//div[@class="infor"]')
        for book in books:
            name = book.xpath('h2/a/text()').extract()
            if name: # 这里的处理是因为有些推荐书籍的信息不全，我们将其过滤掉
                name = name
            else:
                continue
            author = book.xpath('div[@class="otherInfor"]/a[@class="author"]/text()').extract()
            press = book.xpath('div[@class="otherInfor"]/a[@class="publisher"]/text()').extract()
            publication_data = book.xpath('div[@class="otherInfor"]/span/text()').extract()
            price = book.xpath('div/span[@class="sellPrice"]/text()').extract()

            item['name'] = ''.join(name)
            item['author'] = ''.join(author)
            item['press'] = ''.join(press)
            item['publication_data'] = publication_data
            item['publication_data'] = ''.join(publication_data).replace('\xa0','').replace('/','')
            item['price'] = ''.join(price)
            yield item
        # 获取下一页的URL
        nextpage = selector.xpath('//div[@class="topPage"]/a[@class="nextPage"]/@href').extract()
        print(nextpage) # 这里实时查看我们接下来爬取的URL链接
        time.sleep(3) # 休眠3秒，减缓网站服务器的压力
        if nextpage: # 如果还有下一页，继续爬取
            nextpage = nextpage[0]
            yield Request(self.next_url + str(nextpage),headers=self.headers,callback=self.parse)

二、编写 items.py

import scrapy


class BookstoreItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field() # 书名
    author = scrapy.Field() # 作者
    press = scrapy.Field() # 出版社
    publication_data = scrapy.Field() # 出版时间
    price = scrapy.Field() # 价格

三、编写 pipeline.py

import json

class BookstorePipeline(object):
    def process_item(self, item, spider):
        with open('E:/study/python/pycodes/bookstore/book.json','a+',encoding='utf-8') as f:
            lines = json.dumps(dict(item),ensure_ascii=False) + '\n'
            f.write(lines)
        return item

四、设置setting.py

设置不遵守 robots.txt 协议
打开 item_pipelines

最后运行程序：scrapy crawl store
(PS: 要切换至项目目录下执行）

我们可以看到数据在疯狂下载

接下来看看我们的成果

哈哈~
这都是一些简单入门级别的，遇到困难不要慌慢慢来

来源：CSDN

作者：爬虫小彪

链接：https://blog.csdn.net/qq_42803848/article/details/103857167

标签

scrapy