2.1 学习beautifulsoup

学习beautifulsoup，并使用beautifulsoup提取内容。
使用beautifulsoup提取丁香园论坛的回复内容。

2.2学习xpath

学习xpath，使用lxml+xpath提取内容。
使用xpath提取丁香园论坛的回复内容。

一、学习beautifulsoup：

1.简介：

BeautifulSoup是一个Python的HTML和XML的解析库，用来从网页中提取数据。

BeautifulSoup会自动将文档转换为Unicode编码，输出文档转换为UTF-8编码。

导入BeautifulSoup方法：from bs4 import BeautifulSoup

中文文档地址：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

2.解析器：

推荐使用lxml解析器，如果使用lxml解析器，则在创建BeautifulSoup对象的时候，第二个参数填：lxml

eg:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<p>Hello</p>', 'lxml')

3.基本使用：

 1 html = """
 2 <html><head><title>The Dormouse's story</title></head>
 3 <body>
 4 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
 5 <p class="story">Once upon a time there were three little sisters; and their names were
 6 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
 7 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
 8 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 9 and they lived at the bottom of a well.</p>
10 <p class="story">...</p>
11 """
12 from bs4 import BeautifulSoup
13 soup = BeautifulSoup(html, 'lxml')
14 print(soup.prettify())
15 print(soup.title.string)

输出结果：

 1 <html>
 2  <head>
 3   <title>
 4    The Dormouse's story
 5   </title>
 6  </head>
 7  <body>
 8   <p class="title" name="dromouse">
 9    <b>
10     The Dormouse's story
11    </b>
12   </p>
13   <p class="story">
14    Once upon a time there were three little sisters; and their names were
15    <a class="sister" href="http://example.com/elsie" id="link1">
16     <!-- Elsie -->
17    </a>
18    ,
19    <a class="sister" href="http://example.com/lacie" id="link2">
20     Lacie
21    </a>
22    and
23    <a class="sister" href="http://example.com/tillie" id="link3">
24     Tillie
25    </a>
26    ;
27 and they lived at the bottom of a well.
28   </p>
29   <p class="story">
30    ...
31   </p>
32  </body>
33 </html>
34 The Dormouse's story

BeautifulSoup会自动更正格式。给出的HTML代码不完整，它的html标签和body标签都没有闭合。BeautifulSoup会在初始化创建BeautifulSoup对象的时候，就自动更正格式，将html标签和body标签补充完整
prettify()方法：可以将要解析的字符串以标准格式输出
soup.title：选择出HTML中的title节点
soup.title.string：调用string属性，可以得到节点内部的文本

选择器：BeautifulSoup有3种选择器：1）节点选择器 2）方法选择器 3）CSS选择器

4.节点选择器：

4.1选择元素：

BeautifulSoup对象.节点的名称

eg：soup.title

得到的结果为：节点+其内容的全部内容：<title>The Dormouse's story</title>
返回的结果类型永远为：bs4.element.Tag类型。经过选择器选择后，结果都是Tag类型，Tag类型有name、attr、string属性，可以调用属性
注意：当有多个节点的时候，这种选择方式只会选择到第一个匹配到的节点，其他的后面的节点都会被自动忽略

 1 html = """
 2 <html><head><title>The Dormouse's story</title></head>
 3 <body>
 4 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
 5 <p class="story">Once upon a time there were three little sisters; and their names were</p>
 6 """
 7 from bs4 import BeautifulSoup
 8 soup = BeautifulSoup(html, 'lxml')
 9 print(soup.title)
10 print(type(soup.title))
11 print(soup.title.string)
12 print(soup.head)
13 print(soup.p)

运行结果：

1 <title>The Dormouse's story</title>
2 <class 'bs4.element.Tag'>
3 The Dormouse's story
4 <head><title>The Dormouse's story</title></head>
5 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>

4.2Tag对象节点属性：

节点Tag对象具有三大属性：name、attr、string

1）name：

用来获取节点的名称

先选择节点，然后调用name属性就可以得到节点名称

print(soup.title.name)

运行结果：title

先选择title节点，然后调用name属性，得到title节点的名称：title

2）attrs：

用来获取节点的属性值

调用attrs，获取节点的所有的属性值。返回的字典形式的{'属性1'：'属性值1'，'属性2'：'属性值2'}
调用attrs['属性名']，得到节点相应的属性值。attrs['属性1']
节点元素['属性名']，得到节点相应的属性值。
注意：返回结果，有的是字符串，有的是字符串组成的列表

1 print(soup.p.attrs)
2 print(soup.p.attrs['name'])
3 print(soup.p['name'])
4 print(soup.p['class'])

运行结果：

1 {'class': ['title'], 'name': 'dromouse'}
2 dromouse
3 dromouse
4 ['title']

3）string：

用来获取节点元素包含的文本内容

print(soup.title.string)

运行结果：The Dormouse's story

注意：string只适用于节点元素内部，没有子节点、或者只有一个子节点。当存在多个子节点时，.string 方法应该调用哪个子节点的内容, .string 的输出结果是 None

提取节点内部文本，不建议使用string属性，建议使用get_text()方法

4.3 嵌套选择：

选择节点元素里面

的节点元素

 1 html = """
 2 <html><head><title>The Dormouse's story</title></head>
 3 <body>
 4 """
 5 
 6 from bs4 import BeautifulSoup
 7 soup = BeautifulSoup(html, 'lxml')
 8 print(soup.head)
 9 print(type(soup.head))
10 print(soup.head.title))
11 print(type(soup.head.title))
12 print(soup.head.title.string)

运行结果：

1 <html><head><title>The Dormouse's story</title></head>
2 <class 'bs4.element.Tag'>
3 <title>The Dormouse's story</title>
4 <class 'bs4.element.Tag'>
5 The Dormouse's story

4.4 关联选择：

根据基准节点，查找它的子节点、孙节点、父节点、祖先节点、兄弟节点

1）子节点、孙节点：

contents

得到直接子节点（既包含文本，也包含节点）列表，返回结果是列表形式

children

得到直接子节点（既包含文本，也包含节点），返回类型是生成器类型，可以用for循环输出需要内容

descendants

得到所有的子孙节点（既包含文本，也包含节点），返回结果是生成器类型

2）父节点、祖先节点（直接父节点、爷爷节点、太爷爷节点.....）

parent

得到直接父节点（父节点及其内部全部内容），不再向外寻找父节点的祖先节点

parents

得到所有的祖先节点（直接父节点、爷爷节点、太爷爷节点.....），返回类型是生成器类型

注意：这里会除了父节点以外，会将整个文档的信息最后在放进去一遍，最后一个元素的类型为BeautifulSoup对象

3）兄弟节点（同级节点）：

next_sibling

获取基准节点的下一个兄弟节点

previous_sibling

获取基准节点的上一个兄弟节点元素

next_siblings

获取基准节点的后面的所有的兄弟节点元素

previous_siblings

获取基准节点的前面的所有的兄弟节点元素

5.方法选择器：

5.1 find_all(name, attrs, recursive, text, **kwargs)

查询所有符合条件的元素，返回结果是列表类型，每个元素依然都是bs4.element.Tag类型

支持嵌套查询

参数：

name

根据节点的名字来查询元素。节点的名字：a、p、ul、li、div、title.........

attrs

根据节点的属性查询元素

注意：传入的attrs参数类型是字典类型

text

根据节点内部文本，来查询元素。传入的text参数形式可以是字符串，也可以是正则表达式对象

5.2 find()

同find_all()，只不过返回的是单个元素，即第一个匹配到的元素

5.3 find_parents()、find_parent()：

前者返回所有的祖先节点，后者返回直接的父节点

5.4 find_next_siblings()、find_next_siblings()：

前者返回后面所有的兄弟节点，后者返回后面的第一个兄弟节点

5.5 find_previous_siblings()、find_previous_siblings()：

前者返回前面所有的兄弟节点，后者返回前面的第一个兄弟节点

5.6 find_all_next()、find_next()：

前者返回当前节点的后面的所有符合条件的节点，后者返回当前节点后面的第一个符合条件的节点

5.7 find_all_previous()、find_previous()：

前者返回当前节点的前面的所有符合条件的节点，后者返回当前节点前面的第一个符合条件的节点

6.CSS选择器：

7.爬取丁香园的帖子回复内容

# 使用beautifulsoup提取丁香园论坛的回复内容。
# 丁香园直通点：http://www.dxy.cn/bbs/thread/626626#626626 。

import requests
from bs4 import BeautifulSoup

class dingxiangyuan():
    #1.发送请求
    def send_request(self):
        #1.1添加请求头：
        headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
        #1.2 url
        url = 'http://www.dxy.cn/bbs/thread/626626#626626'
        #发送请求
        response = requests.get(url=url,headers=headers)
        return response
    #2.解析数据
    def parse(self,response):
        #读取response数据
        response_data = response.content.decode()
        #初始化BeautifulSoup，使用BeautifulSoup解析response数据，使用lxml解析器
        bsoup = BeautifulSoup(response_data, 'lxml')
        #获取所有回复节点
        replies = bsoup.find_all(name='td', attrs={'class': 'postbody'})
        #print(replies)
        for reply in replies:
            reply_content = reply.get_text().strip()
            print(reply_content)
    #3.存储数据
    #4.运行
    def run(self):
        response = self.send_request()
        self.parse(response)
        pass

dingxiangyuan().run()

来源：https://www.cnblogs.com/tommyngx/p/11319551.html

标签

python爬虫

html代码

lxml

datawhale爬虫task02

2.1 学习beautifulsoup

2.2学习xpath

一、学习beautifulsoup：

1.简介：

2.解析器：

3.基本使用：

选择器：BeautifulSoup有3种选择器：1）节点选择器 2）方法选择器 3）CSS选择器

4.节点选择器：

5.方法选择器：

6.CSS选择器：

7.爬取丁香园的帖子回复内容