Beautiful Soup的简介
Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据。官方解释如下:
1、Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。
2、Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。然后,你仅仅需要说明一下原始编码方式就可以了。
3、Beautiful Soup已成为和lxml、html6lib一样出色的python解释器,为用户灵活地提供不同的解析策略或强劲的速度。
各种解析器优缺点
Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,如果我们不安装它,则 Python 会使用 Python默认的解析器,lxml 解析器更加强大,速度更快,推荐安装(pip install lxml)
解析器 | 使用方法 | 优势 | 劣势 |
Python标准库 | BeautifulSoup(markup, “html.parser”) | Python的内置标准库、执行速度适中、文档容错能力强 | Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差 |
lxml HTML 解析器 | BeautifulSoup(markup, “lxml”) | 速度快、文档容错能力强 | 需要安装C语言库 |
lxml XML 解析器 | BeautifulSoup(markup, [“lxml”, “xml”])或BeautifulSoup(markup, “xml”) | 速度快、唯一支持XML的解析器 | 需要安装C语言库 |
html5lib | BeautifulSoup(markup, “html5lib”) | 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档、不依赖外部扩展 | 速度慢 |
BeautifulSoup
简介
1、将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄
2、首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码
3、然后Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档
例1:
from bs4 import BeautifulSoup #导入bs4库
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器,创建beautifulsoup对象
print(type(soup))
HTML_prettify = soup.prettify() #打印一下soup对象的内容,格式化输出
print(HTML_prettify)
"""
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title" name="dromouse">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
"""
另外,我们还可以用本地HTML文件来创建对象,例如
例1_1:
from bs4 import BeautifulSoup
html = open("F:\\test.txt","r",encoding="utf-8")
soup = BeautifulSoup(html,features="lxml") #将本地文件打开,用它来创建soup对象
注:
1、BeautifulSoup()方法返回的是一个bs4.BeautifulSoup对象,我们可以根据这个对象来使用不同的方法来获得HTML中我们需要的数据
2、在BeautifulSoup()方法中感觉最好指定解析器(使用lxml解析器),不然有时候会报错
3、上面例子中使用了prettify()方法:该方法用于格式化打印出获得的内容。这个函数经常用到所以要记住了
四大对象种类
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
1、Tag
2、NavigableString
3、BeautifulSoup
4、Comment
标签:tag
1、HTML中tag是由尖括号包围的关键词,即HTML中的一个个标签。一般是成对出现的。比如<p>和</p>
2、成对的tag里,第一个(不带"/"的)叫开始tag(又叫开放tag),第二个叫结束tag。(又叫闭合tag)
3、例如:
<title>The Dormouse's story</title>或<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
上面的title、a等等HTML标签加上里面包括的内容就是Tag,在BeautifulSoup中可以利用BeautifulSoup()方法返回的对象加标签名轻松地获取这些标签的内容
例2:
from bs4 import BeautifulSoup #导入bs4库
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器
HTML_title = soup.title
print(HTML_title)
HTML_head = soup.head
print(HTML_head)
HTML_a = soup.a
print(HTML_a)
HTML_p = soup.p
print(HTML_p)
print(type(HTML_p))
"""
<title>The Dormouse's story</title>
<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<class 'bs4.element.Tag'>
"""
注:
1、从上面例子中我们可以看出:可以使用soup对象加标签名轻松地获取这些标签的内容
2、不过需要注意的是:它查找的是在所有内容中的第一个符合要求的标签,如果要查询所有的标签,我们在后面进行介绍
tag中的属性:name和attrs
name:标签页中标签页的名称
例3:
from bs4 import BeautifulSoup #导入bs4库
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器
soup_tag_name = soup.name
title_tag_name = soup.title.name
head_tag_name = soup.head.name
a_tag_name = soup.a.name
print(soup_tag_name)
print(title_tag_name)
print(head_tag_name)
print(a_tag_name)
"""
[document]
title
head
a
"""
注:
1、对于soup对象来说:soup对象本身比较特殊,它的name即为[document]
2、对于其他内部标签:输出的值便为标签本身的名称
attrs:标签对中的内容
使用attrs方法可以把标签对中的内容以字典形式返回
例4:
from bs4 import BeautifulSoup #导入bs4库
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器
soup_tag_attrs = soup.attrs
title_tag_attrs = soup.title.attrs
head_tag_attrs = soup.head.attrs
a_tag_attrs = soup.a.attrs
print(soup_tag_attrs)
print(title_tag_attrs)
print(head_tag_attrs)
print(a_tag_attrs)
"""
{}
{}
{}
{'class': ['sister'], 'href': 'http://example.com/elsie', 'id': 'link1'}
"""
注:
1、从上面的输出结果可以看出对于soup对象、head标签对、title标签对来说返回的为空字典:其标签对里面没有属性值以及对应的值(key:value)
2、对应存在key:value的标签:使用attrs方法可以将其所有的属性打印输出了出来,得到的类型是一个字典
3、如果我们想要单独获取某个属性具体的值时,可以使用下面三种方法:
⑴使用字典的索引:attrs返回的为一个字典,所以可以直接使用字典的方法
⑵使用soup对象.标签名.属性名(键名)
⑶使用soup对象.标签名.get(属性名)
例5:
from bs4 import BeautifulSoup #导入bs4库
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器
a_tag_attrs = soup.a.attrs
print(a_tag_attrs)
a_tag_attrs_href_dict = a_tag_attrs["href"] #使用字典的索引
print(a_tag_attrs_href_dict)
a_tag_attrs_href = soup.a["href"]#使用soup对象.标签名.属性名(键名)
print(a_tag_attrs_href )
a_tag_attrs_href_get = soup.a.get("href")#使用soup对象.标签名.get(属性名)
print(a_tag_attrs_href_get)
"""
{'href': 'http://example.com/elsie', 'id': 'link1', 'class': ['sister']}
http://example.com/elsie
http://example.com/elsie
http://example.com/elsie
"""
注:
1、从上面例子中可以看出要获得标签对中具体属性的值时,共有三种方法:
⑴使用字典的方法相对于其他两种来说多了一步,会显得麻烦
⑵使用使用soup对象.标签名.属性名方法时:需要注意,需要使用中括号将属性名括起来
⑶使用get方法,传入属性的名称,这种方法与上面一种第二种是等价的
多值属性
HTML 4定义了一系列可以包含多个值的属性.在HTML5中移除了一些,却增加更多.最常见的多值的属性是 class (一个tag可以有多个CSS的class). 还有一些属性 rel , rev , accept-charset , headers , accesskey . 在Beautiful Soup中多值属性的返回类型是list
例5_2:
from bs4 import BeautifulSoup
css_soup = BeautifulSoup('<p class="body strikeout"></p>',"lxml")
print(css_soup.p['class'])
css_soup = BeautifulSoup('<p class="body"></p>',"lxml")
print(css_soup.p['class'])
"""
['body', 'strikeout']
['body']
"""
注:
1、如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回
2、将tag转换成字符串时,多值属性会合并为一个值
3、如果转换的文档是XML格式,那么tag中不包含多值属性
例5_3:
from bs4 import BeautifulSoup
css_soup = BeautifulSoup('<p id="my id"></p>',"lxml")
print(css_soup.p['id'])
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>',"lxml")
print(rel_soup.a['rel'])
#如果转换的文档是XML格式,那么tag中不包含多值属性
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
print(xml_soup.p['class'])
"""
my id
['index']
body strikeout
"""
NavigableString
1、字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串
2、既然我们已经得到了标签的内容,那么问题来了,我们要想获取标签内部的文字怎么办呢?很简单,用 .string 即可
例6:
from bs4 import BeautifulSoup #导入bs4库
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器,创建beautifulsoup对象
head_string = soup.head.string
p_string = soup.p.string
a_string = soup.a.string
print(head_string)
print(type(head_string))
print(p_string)
print(a_string)
"""
The Dormouse's story
<class 'bs4.element.NavigableString'>
The Dormouse's story
Elsie
"""
注:它的类型是一个NavigableString,翻译过来叫 可以遍历的字符串
BeautifulSoup
BeautifulSoup 对象表示的是一个文档的全部内容。大部分时候可以把它当作Tag对象,是一个特殊的Tag,我们可以分别获取它的类型,名称,以及属性来感受一下
例7:
from bs4 import BeautifulSoup #导入bs4库
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器,创建beautifulsoup对象
soup_name = soup.name
print(soup_name)
print(type(soup_name))
soup_attrs = soup.attrs
print(soup_attrs)
"""
[document]
<class 'str'>
{}
"""
Comment
Comment 对象是一个特殊类型的NavigableString对象,其实输出的内容仍然不包括注释符号,但是如果不好好处理它,可能会对我们的文本处理造成意想不到的麻烦。
例8:找一个带注释的标签
from bs4 import BeautifulSoup #导入bs4库
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,"lxml")#指定解析器,创建beautifulsoup对象
a_string = soup.a.string
print(soup.a)
print(a_string)
print(type(a_string))
"""
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
Elsie
<class 'bs4.element.Comment'>
"""
注:
1、上面例子中:a标签里的内容实际上是注释,但是如果我们利用 .string 来输出它的内容,我们发现它已经把注释符号去掉了,所以这可能会给我们带来不必要的麻烦。
2、如果需要忽略注释内容的话,可以利用get_text()或者.text:a_string = soup.a.get_text()
注:
本文是在按照Beautiful Soup 4.2.0 文档学习时记录的。只是为了方便自己以后学习和搜索的,文章中肯定会有错误或者遗漏的,因此如果有幸被您看到,可以直接参考其官方文档:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#string
来源:CSDN
作者:不怕猫的耗子A
链接:https://blog.csdn.net/qq_39314932/article/details/99338957