Scrapy这个成熟的爬虫框架,用起来之后发现并没有想象中的那么难。即便是在一些小型的项目上,用scrapy甚至比用requests、urllib、urllib2更方便,简单,效率也更高。废话不多说,下面详细介绍下如何用scrapy将妹子图爬下来,存储在你的硬盘之中。关于Python、Scrapy的安装以及scrapy的原理这里就不作介绍,自行google、百度了解学习。
一、开发工具
Pycharm 2017
Python 2.7
Scrapy 1.5.0
requests
二、爬取过程
1、创建mzitu项目
进入"E:\Code\PythonSpider>"目录执行scrapy startproject mzitu命令创建一个爬虫项目:
1 scrapy startproject mzitu
执行完成后,生产目录文件结果如下:
1 ├── mzitu
2 │ ├── mzitu
3 │ │ ├── __init__.py
4 │ │ ├── items.py
5 │ │ ├── middlewares.py
6 │ │ ├── pipelines.py
7 │ │ ├── settings.py
8 │ │ └── spiders
9 │ │ ├── __init__.py
10 │ │ └── Mymzitu.py
11 │ └── scrapy.cfg
2、进入mzitu项目,编写修改items.py文件
定义titile,用于存储图片目录的名称
定义img,用于存储图片的url
定义name,用于存储图片的名称
1 # -*- coding: utf-8 -*-
2
3 # Define here the models for your scraped items
4 #
5 # See documentation in:
6 # https://doc.scrapy.org/en/latest/topics/items.html
7
8 import scrapy
9
10 class MzituItem(scrapy.Item):
11 # define the fields for your item here like:
12 title = scrapy.Field()
13 img = scrapy.Field()
14 name = scrapy.Field()
3、编写修改spiders/Mymzitu.py文件
1 # -*- coding: utf-8 -*-
2 import scrapy
3 from mzitu.items import MzituItem
4 from lxml import etree
5 import requests
6 import sys
7 reload(sys)
8 sys.setdefaultencoding('utf8')
9
10
11 class MymzituSpider(scrapy.Spider):
12 def get_urls():
13 url = 'http://www.mzitu.com'
14 headers = {}
15 headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
16 r = requests.get(url,headers=headers)
17 html = etree.HTML(r.text)
18 urls = html.xpath('//*[@id="pins"]/li/a/@href')
19 return urls
20
21 name = 'Mymzitu'
22 allowed_domains = ['www.mzitu.com']
23 start_urls = get_urls()
24
25 def parse(self, response):
26 item = MzituItem()
27 #item['title'] = response.xpath('//h2[@class="main-title"]/text()')[0].extract()
28 item['title'] = response.xpath('//h2[@class="main-title"]/text()')[0].extract().split('(')[0]
29 item['img'] = response.xpath('//div[@class="main-image"]/p/a/img/@src')[0].extract()
30 item['name'] = response.xpath('//div[@class="main-image"]/p/a/img/@src')[0].extract().split('/')[-1]
31 yield item
32
33 next_url = response.xpath('//div[@class="pagenavi"]/a/@href')[-1].extract()
34 if next_url:
35 yield scrapy.Request(next_url, callback=self.parse)
我们要爬取的是妹子图网站“最新”的妹子图片,对应的主url是http://www.mzitu.com,通过查看网页源代码发现每一个图片主题的url在<li>标签中,通过上面代码中get_urls函数可以获取,并且返回一个url列表,这里必须说明一下,用python写爬虫,像re、xpath、Beautiful Soup之类的模块必须掌握一个,否则根本无法下手。这里使用xpath工具来获取url地址,在lxml和scrapy中,都支持使用xpath。
1 def get_urls():
2 url = 'http://www.mzitu.com'
3 headers = {}
4 headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
5 r = requests.get(url,headers=headers)
6 html = etree.HTML(r.text)
7 urls = html.xpath('//*[@id="pins"]/li/a/@href')
8 return urls
name定义爬虫的名称,allowed_domains定义包含了spider允许爬取的域名(domain)列表(list),start_urls定义了爬取了url列表。
1 name = 'Mymzitu'
2 allowed_domains = ['www.mzitu.com']
3 start_urls = get_urls()
分析图片详情页,获取图片主题、图片url和图片名称,同时获取下一页,循环爬取:
1 def parse(self, response):
2 item = MzituItem()
3 #item['title'] = response.xpath('//h2[@class="main-title"]/text()')[0].extract()
4 item['title'] = response.xpath('//h2[@class="main-title"]/text()')[0].extract().split('(')[0]
5 item['img'] = response.xpath('//div[@class="main-image"]/p/a/img/@src')[0].extract()
6 item['name'] = response.xpath('//div[@class="main-image"]/p/a/img/@src')[0].extract().split('/')[-1]
7 yield item
8
9 next_url = response.xpath('//div[@class="pagenavi"]/a/@href')[-1].extract()
10 if next_url:
11 yield scrapy.Request(next_url, callback=self.parse)
4、编写修改pipelines.py文件,下载图片
1 # -*- coding: utf-8 -*-
2
3 # Define your item pipelines here
4 #
5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
7 import requests
8 import os
9
10 class MzituPipeline(object):
11 def process_item(self, item, spider):
12 headers = {
13 'Referer': 'http://www.mzitu.com/'
14 }
15 local_dir = 'E:\\data\\mzitu\\' + item['title']
16 local_file = local_dir + '\\' + item['name']
17 if not os.path.exists(local_dir):
18 os.makedirs(local_dir)
19 with open(local_file,'wb') as f:
20 f.write(requests.get(item['img'],headers=headers).content)
21 return item
5、middlewares.py文件中新增一个RotateUserAgentMiddleware类
1 class RotateUserAgentMiddleware(UserAgentMiddleware):
2 def __init__(self, user_agent=''):
3 self.user_agent = user_agent
4 def process_request(self, request, spider):
5 ua = random.choice(self.user_agent_list)
6 if ua:
7 request.headers.setdefault('User-Agent', ua)
8 #the default user_agent_list composes chrome,IE,firefox,Mozilla,opera,netscape
9 #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
10 user_agent_list = [
11 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
12 "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
13 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
14 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
15 "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
16 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
17 "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
18 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
19 "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
20 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
21 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
22 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
23 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
24 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
25 "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
26 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
27 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
28 "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
29 ]
6、settings.py设置
1 # Obey robots.txt rules
2 ROBOTSTXT_OBEY = False
3 # Configure maximum concurrent requests performed by Scrapy (default: 16)
4 CONCURRENT_REQUESTS = 100
5 # Disable cookies (enabled by default)
6 COOKIES_ENABLED = False
7 DOWNLOADER_MIDDLEWARES = {
8 'mzitu.middlewares.MzituDownloaderMiddleware': 543,
9 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None,
10 'mzitu.middlewares.RotateUserAgentMiddleware': 400,
11 }
7、运行爬虫
进入E:\Code\PythonSpider\mzitu目录,运行scrapy crawl Mymzitu命令启动爬虫:
运行结果及完整代码详见:https://github.com/Eivll0m/PythonSpider/tree/master/mzitu
来源:oschina
链接:https://my.oschina.net/u/4280386/blog/4238496