目录
示例网站:http://example.python-scraping.com
资源提供:https://www.epubit.com/
第一章:网络爬虫简介
1.1 网络爬虫何时会有用?
- 以结构化的格式,获取网上的批量数据(理论上可以手工,但是自动化可以省时省力)
1.2 网络爬虫是否合法?
- 被抓取的数据用于个人用途,且在合理使用版权法的条件下,通常没有问题
1.3 python3
- 工具:
- anaconda
- virtual environment wrapper (https://virtuallenvwrapper.readthedocs.io/en/latest)
- conda (https://conda.io/docs/intro.html)
- python 版本:python3.4+
1.4 背景调研
- 调研工具:
- robots.txt
- sitemap
- google -> WHQIS
1.4.1 检查robots.txt
- 了解当前网站的爬取限制
- 可以发现和网站结构相关的线索
- 详见:http://robotstxt.org
1.4.2 检查网站地图(sitemap)
- 帮助爬虫定位网站最新的内容,无需爬取每一个网页
- 网站地图标准定义:http://www.sitemap.org/protocol.html
1.4.3 估算网站大小
- 目标网站大小会影响我们爬取方式:效率问题
- 工具:https://www.google.com/advanced_search
- 在域名后面添加url路径,可以对结果过滤,仅显示网站的某些部分
1.4.4 识别网站所有技术
- detectem模块 (pip install detectem)
- 工具:
- 安装 Docker (http://www.docker.com/products/overview)
- bash:$docker pull scrapinghub/splash
- bash:$pip install detectem
- python 虚拟环境(https://docs.python.org/3/library/venv.html)
- conda 环境(https://conda.io/docs/using/envs.com)
- 查看项目的README(https://github.com/spectresearch/detectem)
$ det http://example.python-scraping.com ''' [{'name': 'jquery', 'version': '1.11.0'}, {'name': 'modernizr', 'version': '2.7.1'}, {'name': 'nginx', 'version': '1.12.2'}] ''' $ docker pull wappalyzer/cli $ docker run wappalyzer/cli http://example.python-scraping.com
1.4.5 寻找网站所有者
- 寻找网站所有者:使用WHOIS协议查询网站域名注册所有者
- python 中有针对该协议封装的库(https://pypi.python.org/pypi/python-whois)
- 安装:pip install python-whois
import whois print(whois.whois('url'))
1.5 编写第一个网络爬虫
- 爬取:下载包涵感兴趣数据的网页
- 爬取所用的方法有很多,选取哪种更合适:取决于目标网站的结构
- 三种爬取网站的常见方法:
- 爬取网站地图
- 使用数据库ID便历每一个网页
- 跟踪网页链接
1.5.1 抓取与爬取的对比
- 抓取:针对特定网站,并在站点上获取指定信息
- 爬取:通用的方式构建,目标是一系列顶级域名的网站或是整个网络。可以用来收集更具体的信息,更常见的是爬取整个网络。从不同站点或页面获取的小而通用的信息,然后跟踪连接到其他页面中。
1.5.2 下载网页
1.5.2.1 下载网页
- 下载时经常遇到临时错误:
- 服务器过载(503 Service Unavailable)
- 短暂等待后继续尝试重新下载
- 网页不存在(404 Not Found)
- 请求时发生问题(4XX)-重新下载无效果
- 服务端存在问题(5XX)-可重新下载
- 服务器过载(503 Service Unavailable)
1.5.2.2 设置代理
import urllib.request from urllib.error import URLError, HTTPError, ContentTooShortError # user_agent='wswp' 设置用户代理 def download(url, num_retries=2, user_agent='wswp'): print('Downloading:', url) # 设置用户代理 request = urllib.request.Request(url) request.add_header('User-agent', user_agent) try: html = urllib.request.urlopen(request).read() except (URLError, HTTPError, ContentTooShortError) as e: print('Download error:', e.reason) html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries - 1) return html
1.5.3 网站地图爬虫
- 使用正则表达式将robots.txt的url从
标签中取出
# 导入url解析库 import urllib.request # 导入正则库 import re # 导入解析错误库 from urllib.error import URLError, HTTPError, ContentTooShortError def download(url, num_retries=2, user_agent='wswp', charset='utf-8'): print('Downloading:', url) request = urllib.request.Request(url) request.add_header('User-agent', user_agent) try: resp = urllib.request.urlopen(request) cs = resp.headers.get_content_charset() if not cs: cs = charset html = resp.read().decode(cs) except (URLError, HTTPError, ContentTooShortError) as e: print('Download error:', e.reason) html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries - 1) return html def crawl_sitemap(url): # download the sitemap file sitemap = download(url) # extract the sitemap links links = re.findall('<loc>(.*?)</loc>', sitemap) # download each link for link in links: html = download(link) # scrape html here test_url = 'http://example.python-scraping.com/sitemap.xml' crawl_sitemap(test_url) ''' Downloading: http://example.python-scraping.com/sitemap.xml Downloading: http://example.python-scraping.com/places/default/view/Afghanistan-1 Downloading: http://example.python-scraping.com/places/default/view/Aland-Islands-2 Downloading: http://example.python-scraping.com/places/default/view/Albania-3 Downloading: http://example.python-scraping.com/places/default/view/Algeria-4 Downloading: http://example.python-scraping.com/places/default/view/American-Samoa-5 Downloading: http://example.python-scraping.com/places/default/view/Andorra-6 Downloading: http://example.python-scraping.com/places/default/view/Angola-7 Downloading: http://example.python-scraping.com/places/default/view/Anguilla-8 Downloading: http://example.python-scraping.com/places/default/view/Antarctica-9 Downloading: http://example.python-scraping.com/places/default/view/Antigua-and-Barbuda-10 Downloading: http://example.python-scraping.com/places/default/view/Argentina-11 ... '''
1.5.4 ID便历爬虫
import itertools import urllib.request from urllib.error import URLError, HTTPError, ContentTooShortError def download(url, num_retries=2, user_agent='wswp', charset='utf-8'): print('Downloading:', url) request = urllib.request.Request(url) request.add_header('User-agent', user_agent) try: resp = urllib.request.urlopen(request) cs = resp.headers.get_content_charset() if not cs: cs = charset html = resp.read().decode(cs) except (URLError, HTTPError, ContentTooShortError) as e: print('Download error:', e.reason) html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries - 1) return html def crawl_site(url, max_errors=5): num_errors = 0 for page in itertools.count(1): pg_url = '{}{}'.format(url, page) html = download(pg_url) if html is None: num_errors += 1 if num_errors == max_errors: # reached max number of errors, so exit break else: num_errors = 0 # success - can scrape the result test_url2 = 'http://example.python-scraping.com/view/-' # 暂时存在问题,待调 crawl_sitemap(test_url2)