什么是网络爬虫

百度百科解释：网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。

个人看法：当你需要获取大量数据或者批量处理的时候，Python爬虫可以快速做到这些，从而节省你重复劳动时间。比如：微博私信机器人、批量下载美剧、爬取天猫京东网站获取折扣、刷折扣机票、爬取适合的房源、系统管理员的脚本任务等等。

 1 #!/usr/bin/env python    2 # -*- coding: utf-8 -*-  3   4 import re  5 import requests  6   7   8 # url1 = 'https://www.ybdu.com/xiaoshuo/2/2531/259845.html'  9 #Python学习交流群：125240963，群内每天分享干货，包括最新的python企业案例学习资料和零基础入门教程，欢迎各位小伙伴入群学习交流 10  11 def get_html(url): 12     response = requests.get(url) 13     return response.text 14  15  16 def get_chapter_info(html): 17     ul = re.findall(r'<ul class="mulu_list">(.*?)</ul>', html, re.S)[0] 18     chapter_info = re.findall(r'<a href="(.*?)">(.*?)</a>', ul) 19     return chapter_info 20  21 def get_chapter_content(url): 22     print(url) 23     response = get_html(url) 24     content = re.findall(r'<div id="htmlContent" class="contentbox">(.*?)<div class="ad00">', response, re.S)[0] 25     content = content.replace('&nbsp;', '') 26     content = content.replace('<br />', '') 27     return content 28  29 def main(): 30     url = 'https://www.ybdu.com/xiaoshuo/2/2531/' 31     html = get_html(url) 32     title = re.findall(r'<h1>(.*?)全文阅读</h1>', html)[0].strip() 33     chapter_info = get_chapter_info(html) 34     with open('%s.txt' % title, 'w', encoding='utf-8')as f: 35         f.write('%s\n' % title) 36         for chapter in chapter_info: 37             f.write('%s' % chapter[1]) 38             content = get_chapter_content(url+chapter[0]) 39             f.write('%s' % content) 40 if __name__ == '__main__': 41     main()

运行效果图

文章来源: Python爬取网络小说，了解下

标签

python

url

response

content