爬虫

什么是爬虫？就是伪造浏览器，下载网页源代码，再从源代码获取需要的数据，一般分为两步：
1、伪造浏览器下载网页（requests模块）
2、对网页的内容进行结构化处理（BeautifulSoup模块）

requests模块

安装：pip install requests
下面是这个模块常用到的方法、属性

# 实例化对象，把所有的内容都封装到对象里
response = requests.get(url="https://XXXXXXX")
# 获取状态码
print(response.status_code)
# 获取文本内容,
print(response.text)
# 但此文本内容是乱码，因为默认用的是utf8，而此文本用的是gbk
# 设置编码
response.encoding = 'gbk'
print(response.text)
# 获取二进制格式的文本内容
print(response.content)

BeautifulSoup模块

安装：pip install BeautifulSoup4
这个模块是对下载的文本内容进行结构化处理的
常用的属性、方法如下：

# 把文本传过去，然后用python自带的html解释器处理，进行结构化,返回的是顶级结构
soup = BeautifulSoup(response.text, 'html.parser')
# 找到id是auto-channel-lazyload-article的div 标签，返回的是Tag对象
div = soup.find(name='div', id='auto-channel-lazyload-article')
# 从tag对象可继续找其下面的孩子，如找其li，find_all返回的是Tag对象列表
li_list = div.find_all(name='li')

例一、爬取汽车之家的数据

import requests
from bs4 import BeautifulSoup

# 伪造浏览器，下载页面
# 实例化对象，把所有的内容都封装到对象里
response = requests.get(url="https://www.autohome.com.cn/news/")
# 获取状态码
# print(response.status_code)
# 获取文本内容,
# print(response.text)
# 但此文本内容是乱码，因为默认用的是utf8，而此文本用的是gbk
# 设置编码
response.encoding = 'gbk'
# print(response.text)
# 获取二进制格式的文本内容
# print(response.content)
# 下载页面完毕

# 结构化处理开始
# 把文本传过去，然后用python自带的html解释器处理，进行结构化,返回的是顶级结构
soup = BeautifulSoup(response.text, 'html.parser')
# 找到id是auto-channel-lazyload-article的div 标签，返回的是Tag对象
div = soup.find(name='div', id='auto-channel-lazyload-article')
# 从tag对象可继续找其下面的孩子，如找其li，find_all返回的是Tag对象列表
li_list = div.find_all(name='li')
for li in li_list:
    h3 = li.find(name='h3')
    a = li.find(name='a')
    p = li.find(name='p')
    img = li.find(name='img')
    if not h3:
        continue
    # 获取tag对象文本
    print(h3.text)
    # 获取属性
    print(a.get('href'))
    print(p.text)
    # 下载图片
    img_url = 'https:' + img.get('src')
    filename = img_url.rsplit('/', maxsplit=1)[1]
    img_res = requests.get(img_url)
    with open(filename, 'wb') as f:
        # 需要的是二进制格式
        f.write(img_res.content)
    # 下载图片结构
    print('-------------------------------------------------')

来源：https://www.cnblogs.com/Treelight/p/12271711.html

标签

python爬虫

response