第一章：网络爬虫简介

示例网站：http://example.python-scraping.com

资源提供：https://www.epubit.com/

第一章：网络爬虫简介

1.1 网络爬虫何时会有用？

以结构化的格式，获取网上的批量数据（理论上可以手工，但是自动化可以省时省力）

1.2 网络爬虫是否合法？

被抓取的数据用于个人用途，且在合理使用版权法的条件下，通常没有问题

1.3 python3

工具：
- anaconda
- virtual environment wrapper （https://virtuallenvwrapper.readthedocs.io/en/latest）
- conda (https://conda.io/docs/intro.html)
python 版本：python3.4+

1.4 背景调研

调研工具：
- robots.txt
- sitemap
- google -> WHQIS

1.4.1 检查robots.txt

了解当前网站的爬取限制
可以发现和网站结构相关的线索
详见：http://robotstxt.org

1.4.2 检查网站地图(sitemap)

帮助爬虫定位网站最新的内容，无需爬取每一个网页
网站地图标准定义：http://www.sitemap.org/protocol.html

1.4.3 估算网站大小

目标网站大小会影响我们爬取方式：效率问题
工具：https://www.google.com/advanced_search
- 在域名后面添加url路径，可以对结果过滤，仅显示网站的某些部分

1.4.4 识别网站所有技术

detectem模块 (pip install detectem)
工具：
- 安装 Docker （http://www.docker.com/products/overview）
- bash:$docker pull scrapinghub/splash
- bash:$pip install detectem
- python 虚拟环境（https://docs.python.org/3/library/venv.html）
- conda 环境（https://conda.io/docs/using/envs.com）
- 查看项目的README（https://github.com/spectresearch/detectem）

$ det http://example.python-scraping.com

'''
[{'name': 'jquery', 'version': '1.11.0'},
 {'name': 'modernizr', 'version': '2.7.1'},
 {'name': 'nginx', 'version': '1.12.2'}]
'''
$ docker pull wappalyzer/cli
$ docker run wappalyzer/cli http://example.python-scraping.com

1.4.5 寻找网站所有者

寻找网站所有者：使用WHOIS协议查询网站域名注册所有者
- python 中有针对该协议封装的库（https://pypi.python.org/pypi/python-whois）
- 安装：pip install python-whois
```
import whois
print(whois.whois('url'))
```

1.5 编写第一个网络爬虫

爬取：下载包涵感兴趣数据的网页
爬取所用的方法有很多，选取哪种更合适：取决于目标网站的结构
三种爬取网站的常见方法：
- 爬取网站地图
- 使用数据库ID便历每一个网页
- 跟踪网页链接

1.5.1 抓取与爬取的对比

抓取：针对特定网站，并在站点上获取指定信息
爬取：通用的方式构建，目标是一系列顶级域名的网站或是整个网络。可以用来收集更具体的信息，更常见的是爬取整个网络。从不同站点或页面获取的小而通用的信息，然后跟踪连接到其他页面中。

1.5.2 下载网页

1.5.2.1 下载网页

下载时经常遇到临时错误：
- 服务器过载（503 Service Unavailable）
  - 短暂等待后继续尝试重新下载
- 网页不存在（404 Not Found）
- 请求时发生问题（4XX）-重新下载无效果
- 服务端存在问题（5XX）-可重新下载

1.5.2.2 设置代理

import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError

# user_agent='wswp' 设置用户代理
def download(url, num_retries=2, user_agent='wswp'):
    print('Downloading:', url)
    # 设置用户代理
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    
    try:
        html = urllib.request.urlopen(request).read()
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html

1.5.3 网站地图爬虫

使用正则表达式将robots.txt的url从标签中取出

# 导入url解析库
import urllib.request
# 导入正则库
import re
# 导入解析错误库
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html


def crawl_sitemap(url):
    # download the sitemap file
    sitemap = download(url)
    # extract the sitemap links
    links = re.findall('<loc>(.*?)</loc>', sitemap)
    # download each link
    for link in links:
        html = download(link)
        # scrape html here

test_url = 'http://example.python-scraping.com/sitemap.xml'
crawl_sitemap(test_url)

'''
Downloading: http://example.python-scraping.com/sitemap.xml
Downloading: http://example.python-scraping.com/places/default/view/Afghanistan-1
Downloading: http://example.python-scraping.com/places/default/view/Aland-Islands-2
Downloading: http://example.python-scraping.com/places/default/view/Albania-3
Downloading: http://example.python-scraping.com/places/default/view/Algeria-4
Downloading: http://example.python-scraping.com/places/default/view/American-Samoa-5
Downloading: http://example.python-scraping.com/places/default/view/Andorra-6
Downloading: http://example.python-scraping.com/places/default/view/Angola-7
Downloading: http://example.python-scraping.com/places/default/view/Anguilla-8
Downloading: http://example.python-scraping.com/places/default/view/Antarctica-9
Downloading: http://example.python-scraping.com/places/default/view/Antigua-and-Barbuda-10
Downloading: http://example.python-scraping.com/places/default/view/Argentina-11
...
'''

1.5.4 ID便历爬虫

import itertools
import urllib.request
from urllib.error import URLError, HTTPError, ContentTooShortError


def download(url, num_retries=2, user_agent='wswp', charset='utf-8'):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, num_retries - 1)
    return html


def crawl_site(url, max_errors=5):
    num_errors = 0
    for page in itertools.count(1):
        pg_url = '{}{}'.format(url, page)
        html = download(pg_url)
        if html is None:
            num_errors += 1
            if num_errors == max_errors:
                # reached max number of errors, so exit
                break
        else:
            num_errors = 0
            # success - can scrape the result
test_url2 = 'http://example.python-scraping.com/view/-'
# 暂时存在问题，待调
crawl_sitemap(test_url2)

1.5.5 链接爬虫

1.5.6 使用 request库

1.6 本章小结

来源：https://www.cnblogs.com/Mario-mj/p/11756363.html

标签

网络爬虫

sitemap

num

python

用python写网络爬虫（第二版）

示例网站：http://example.python-scraping.com

资源提供：https://www.epubit.com/

第一章：网络爬虫简介

1.1 网络爬虫何时会有用？

1.2 网络爬虫是否合法？

1.3 python3

1.4 背景调研

1.4.1 检查robots.txt

1.4.2 检查网站地图(sitemap)

1.4.3 估算网站大小

1.4.4 识别网站所有技术

1.4.5 寻找网站所有者

1.5 编写第一个网络爬虫

1.5.1 抓取与爬取的对比

1.5.2 下载网页

1.5.2.1 下载网页

1.5.2.2 设置代理

1.5.3 网站地图爬虫

1.5.4 ID便历爬虫

1.5.5 链接爬虫

1.5.6 使用 request库

1.6 本章小结