Python爬虫：BeautifulSoup库

Beautiful Soup的简介

Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。官方解释如下：

1、Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

2、Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。

3、Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

各种解析器优缺点

Beautiful Soup支持Python标准库中的HTML解析器，还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐安装(pip install lxml)

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快、文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, [“lxml”, “xml”])或BeautifulSoup(markup, “xml”)	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档、不依赖外部扩展	速度慢

BeautifulSoup

简介

1、将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄

2、首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码

3、然后Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档

例1：

from bs4 import BeautifulSoup #导入bs4库

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器，创建beautifulsoup对象

print(type(soup))

HTML_prettify = soup.prettify() #打印一下soup对象的内容，格式化输出
print(HTML_prettify)


"""
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
"""

另外，我们还可以用本地HTML文件来创建对象，例如
例1_1：

from bs4 import BeautifulSoup

html = open("F:\\test.txt","r",encoding="utf-8")
soup = BeautifulSoup(html,features="lxml") #将本地文件打开，用它来创建soup对象

注：
1、BeautifulSoup()方法返回的是一个bs4.BeautifulSoup对象，我们可以根据这个对象来使用不同的方法来获得HTML中我们需要的数据

2、在BeautifulSoup()方法中感觉最好指定解析器(使用lxml解析器)，不然有时候会报错

3、上面例子中使用了prettify()方法：该方法用于格式化打印出获得的内容。这个函数经常用到所以要记住了

四大对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
1、Tag

2、NavigableString

3、BeautifulSoup

4、Comment

标签：tag

1、HTML中tag是由尖括号包围的关键词，即HTML中的一个个标签。一般是成对出现的。比如<p>和</p>

2、成对的tag里，第一个（不带"/"的）叫开始tag（又叫开放tag），第二个叫结束tag。（又叫闭合tag）

3、例如：

<title>The Dormouse's story</title>或<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

上面的title、a等等HTML标签加上里面包括的内容就是Tag，在BeautifulSoup中可以利用BeautifulSoup()方法返回的对象加标签名轻松地获取这些标签的内容

例2：

from bs4 import BeautifulSoup #导入bs4库

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器
HTML_title = soup.title
print(HTML_title)

HTML_head = soup.head
print(HTML_head)

HTML_a = soup.a
print(HTML_a)

HTML_p = soup.p
print(HTML_p)
print(type(HTML_p))

"""
<title>The Dormouse's story</title>
<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<class 'bs4.element.Tag'>
"""

注：
1、从上面例子中我们可以看出：可以使用soup对象加标签名轻松地获取这些标签的内容

2、不过需要注意的是：它查找的是在所有内容中的第一个符合要求的标签，如果要查询所有的标签，我们在后面进行介绍

tag中的属性：name和attrs

name：标签页中标签页的名称

例3：

from bs4 import BeautifulSoup #导入bs4库

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器
soup_tag_name = soup.name
title_tag_name = soup.title.name
head_tag_name = soup.head.name
a_tag_name = soup.a.name

print(soup_tag_name)
print(title_tag_name)
print(head_tag_name)
print(a_tag_name)

"""
[document]
title
head
a
"""

注：
1、对于soup对象来说：soup对象本身比较特殊，它的name即为[document]

2、对于其他内部标签：输出的值便为标签本身的名称

attrs：标签对中的内容
使用attrs方法可以把标签对中的内容以字典形式返回
例4：

from bs4 import BeautifulSoup #导入bs4库

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器
soup_tag_attrs = soup.attrs
title_tag_attrs = soup.title.attrs
head_tag_attrs = soup.head.attrs
a_tag_attrs = soup.a.attrs

print(soup_tag_attrs)
print(title_tag_attrs)
print(head_tag_attrs)
print(a_tag_attrs)

"""
{}
{}
{}
{'class': ['sister'], 'href': 'http://example.com/elsie', 'id': 'link1'}
"""

注：
1、从上面的输出结果可以看出对于soup对象、head标签对、title标签对来说返回的为空字典：其标签对里面没有属性值以及对应的值(key:value)

2、对应存在key:value的标签：使用attrs方法可以将其所有的属性打印输出了出来，得到的类型是一个字典

3、如果我们想要单独获取某个属性具体的值时，可以使用下面三种方法：
⑴使用字典的索引：attrs返回的为一个字典，所以可以直接使用字典的方法
⑵使用soup对象.标签名.属性名(键名)
⑶使用soup对象.标签名.get(属性名)

例5：

from bs4 import BeautifulSoup #导入bs4库

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器

a_tag_attrs = soup.a.attrs
print(a_tag_attrs)

a_tag_attrs_href_dict = a_tag_attrs["href"] #使用字典的索引
print(a_tag_attrs_href_dict)

a_tag_attrs_href = soup.a["href"]#使用soup对象.标签名.属性名(键名)
print(a_tag_attrs_href )

a_tag_attrs_href_get = soup.a.get("href")#使用soup对象.标签名.get(属性名)
print(a_tag_attrs_href_get)

"""
{'href': 'http://example.com/elsie', 'id': 'link1', 'class': ['sister']}
http://example.com/elsie
http://example.com/elsie
http://example.com/elsie
"""

注：
1、从上面例子中可以看出要获得标签对中具体属性的值时，共有三种方法：
   ⑴使用字典的方法相对于其他两种来说多了一步，会显得麻烦
   ⑵使用使用soup对象.标签名.属性名方法时：需要注意，需要使用中括号将属性名括起来
   ⑶使用get方法，传入属性的名称，这种方法与上面一种第二种是等价的

多值属性
HTML 4定义了一系列可以包含多个值的属性.在HTML5中移除了一些,却增加更多.最常见的多值的属性是 class (一个tag可以有多个CSS的class). 还有一些属性 rel , rev , accept-charset , headers , accesskey . 在Beautiful Soup中多值属性的返回类型是list
例5_2:

from bs4 import BeautifulSoup

css_soup = BeautifulSoup('<p class="body strikeout"></p>',"lxml")
print(css_soup.p['class'])


css_soup = BeautifulSoup('<p class="body"></p>',"lxml")
print(css_soup.p['class'])

"""
['body', 'strikeout']
['body']
"""

注：
1、如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回
2、将tag转换成字符串时,多值属性会合并为一个值
3、如果转换的文档是XML格式,那么tag中不包含多值属性
例5_3:

from bs4 import BeautifulSoup

css_soup = BeautifulSoup('<p id="my id"></p>',"lxml")
print(css_soup.p['id'])


rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>',"lxml")
print(rel_soup.a['rel'])


#如果转换的文档是XML格式,那么tag中不包含多值属性
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
print(xml_soup.p['class'])

"""
my id
['index']
body strikeout
"""

NavigableString

1、字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串
2、既然我们已经得到了标签的内容，那么问题来了，我们要想获取标签内部的文字怎么办呢？很简单，用 .string 即可
例6：

from bs4 import BeautifulSoup #导入bs4库

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器，创建beautifulsoup对象

head_string = soup.head.string
p_string = soup.p.string
a_string = soup.a.string

print(head_string)
print(type(head_string))
print(p_string)
print(a_string)

"""
The Dormouse's story
<class 'bs4.element.NavigableString'>
The Dormouse's story
 Elsie
"""

注：它的类型是一个NavigableString，翻译过来叫可以遍历的字符串

BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容。大部分时候可以把它当作Tag对象，是一个特殊的Tag，我们可以分别获取它的类型，名称，以及属性来感受一下
例7：

from bs4 import BeautifulSoup #导入bs4库

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器，创建beautifulsoup对象

soup_name = soup.name
print(soup_name)
print(type(soup_name))

soup_attrs = soup.attrs
print(soup_attrs)

"""
[document]
<class 'str'>
{}
"""

Comment

Comment 对象是一个特殊类型的NavigableString对象，其实输出的内容仍然不包括注释符号，但是如果不好好处理它，可能会对我们的文本处理造成意想不到的麻烦。

例8：找一个带注释的标签

from bs4 import BeautifulSoup #导入bs4库

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,"lxml")#指定解析器，创建beautifulsoup对象

a_string = soup.a.string
print(soup.a)
print(a_string)
print(type(a_string))

"""
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
<class 'bs4.element.Comment'>
"""

注：
1、上面例子中：a标签里的内容实际上是注释，但是如果我们利用 .string 来输出它的内容，我们发现它已经把注释符号去掉了，所以这可能会给我们带来不必要的麻烦。

2、如果需要忽略注释内容的话，可以利用get_text()或者.text：a_string = soup.a.get_text()

注：

本文是在按照Beautiful Soup 4.2.0 文档学习时记录的。只是为了方便自己以后学习和搜索的，文章中肯定会有错误或者遗漏的，因此如果有幸被您看到，可以直接参考其官方文档：

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#string

来源：CSDN

作者：不怕猫的耗子A

链接：https://blog.csdn.net/qq_39314932/article/details/99338957

标签

python

python爬虫

lxml