网页分析工具beautifulsoup学习

Beautiful Soup是一个用来解析HTML和XML的python库，它可以按照你喜欢的方式去解析文件，查找并修改解析树。它可以很好的处理不规范标记并生成剖析树(parse tree). 它提供简单又常用的导航(navigating)，搜索以及修改剖析树的操作。

安装beautifulsoup

#安装版本3
apt-get install python-beautifulsoup
#安装版本4
apt-get install python-bs4 python-bs4-doc

既然是练习，就使用文档上的例子进行练习，文档的HTML采用以下内容：

<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p><span style="font-size:14px;"> 
</span>

Beautiful Soup模块中有一个BeautifulSoup对象，它会返回结构化的文档。

剖析树： Beautiful Soup剖析一个文档后生成的数据结构。

剖析对象 (BeautifulSoup或 BeautifulStoneSoup的实例)是深层嵌套(deeply-nested), 精心构思的数据结构，可以与XML和HTML结构相互协调。剖析对象包括2个类型的对象，Tag对象，用于操纵像<TITLE> ，<B>这样的标签；NavigableString对象，用于操纵字符串，如"Page title"和"This is paragraph"。

t=BeautifulSoup.BeautifulSoup(file,from_encoding="UTF-8") 类解析HTML文档，返回句柄t，可以指定编码，默认是unicode，可以使用str将beautiful soup文档转化为字符串，str(t)，或者使用prettify，prettify方法添加了一些换行和空格以便让文档看起来更清晰。如果原始文档含有编码声明，Beautiful Soup会将原始的编码声明改为新的编码。也就是说，你载入一个HTML文档到BeautifulSoup后，再输出它，不仅HTML被清理过了，而且可以明显的看到它已经被转换为UTF-8

练习1 返回标准结构化的HTML文档

#!/usr/bin/python
#coding=utf-8
from bs4 import BeautifulSoup
html_doc = '上面html的内容'
soup = BeautifulSoup(html_doc)
print(soup.prettify())
 
结果：
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

可以看到之前横排的HTML内容已经加入一些换行，看起来更清晰一些了。可以使用返回的soup句柄访问具体的数据结构。

print soup.title
> <title>The Dormouse's story</title>

print soup.title.name
> title

print soup.title.string
> The Dormouse's story

print soup.p
> <p class="title"><b>The Dormouse's story</b></p>

print soup.p["class"]
> ['title']

print soup.find_all('a')
> [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Tag对象相当于XML或者HTML文档中的tag标签，Tags含有一些attributes和methods。Tag和NavigableString对象有很多有用的成员。NavigableString对象没有属性，只有Tag 对象有属性。每一个Tag都有一个名称，可以通过tag.name访问该tag的名称，也可以改变tag的名称。一个tag可能含有多个attributes，例如<b class="boldest">，tag标签b包含一个class属性，它的值为boldest，可以通过类似字典的方法访问tag的属性，也可以通过attrs访问tag的属性。有些属性含有多个值，通常像class、rel等这些标签都有多个值，对于多个值的属性，BeautifulSoup会把它们当作列表对待。

soup = BeautifulSoup(html_doc)
print soup.a['href']
>  

print soup.a.attrs
> {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

soup=BeautifulSoup('<p class="story dns">dns</p>')
print soup.p['class']
> ['story', 'dns']

NavigableString对象包含一个tag标签包含的text内容，可以使用tag.string访问该tag的内容，也可修改一个tag的内容，使用tag.string.repace_with('string...')

soup = BeautifulSoup(html_doc)
print soup.title.string
> The Dormouse's story

BeautifulSoup对象相当于整个文档的对象，大多时候你可以把它当作tag对象，这意味着它支持大部分在Navigating the tree和Searching the tree中的方法。

Navigating the tree--Navigating剖析树:

在tag中可能包含其他tag和string，被包含的tag叫做子tag，Beautiful Soup提供了一些不同的方法来访问这些子tag。string不支持这些特性，因为string没有子string。

你可以使用简单的方法访问解析树中的tag，只要指定tag名称即可，如：soup.tag，你也可以在解析树的某一个大的tag中通过不断的解析访问其下的子tag，如访问body标签下的b标签，soup.body.b。这些方法只会返回第一次遇到的结果，如果要在全文中查找某一个标签的所有结果，可以使用Searching the tree中的find_all()方法。

print soup.head
> <head><title>The Dormouse's story</title></head>
print soup.head.title
> <title>The Dormouse's story</title>
print soup.find_all('a')
> [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

父tag包含的子tag存放在一个叫做contents的列表中，string是不含有contents属性的。除了通过列表访问子tag，你还可以通过children迭代访问子tag。

soup = BeautifulSoup(html_doc)
html_tag=soup.html.body.contents[0]

print html_tag
> <p class="title"><b>The Dormouse's story</b></p>

for child in soup.html.body.children:
    print child
> <p class="title"><b>The Dormouse's story</b></p>
> <p class="story">Once upon a time there were three little sisters; and their names were
> <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
> <p class="story">...</p>

.contents 和 .children属性只考虑父tag的直接子tag，而对于子tag中的string不认为是tag，head有一个子tag title。而.descendants属性则会认为在title中的string也算一个。

如果一个tag只有一个子tag，并且子tag是NavigableString，则可以通过.string访问该子tag。如果一个tag的子tag是另外一个tag，而这个子tag含有一个.string，这时父tag会认为.string是子tag

soup = BeautifulSoup(html_doc)
print soup.title.string
> The Dormouse's story

print soup.head.string
> The Dormouse's story

如果一个tag中包含有多个string则可以通过strings来访问所有的string。既然允许从父tag查找子tag，那也可以从子tag回溯查找父tag了。每个tag和string都有父tag。可以通过.parent属性访问该tag的父tag。亦可以通过.parents访问该tag的所有父tag。

soup = BeautifulSoup(html_doc)
print soup.title.parent
> <head><title>The Dormouse's story</title></head>
print soup.title.parent.name
> head

在文档开始的HTML例子中，第二个p标签下面有三个a标签，而且都处于同一级别，我们叫这三个a标签为siblings，可以通过.next_sibling和.previous_sibling属性向前或者向后访问处于同一级别的标签。

soup=BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print soup.b.next_sibling
> <c>text2</c>

print soup.c.previous_sibling
> <b>text1</b>

b标签有 next_sibling却没有 previous_sibling，c标签有 previous_sibling却没有 next_sibling，注意：text1和text2不构成sibling关系，因为他们没有共同的父tag。你还可以使用.next_siblings和.previous_siblings属性遍历指定标签下的所有 sibling标签。

Searching the tree--searching剖析树：

Beautiful Soup在搜索剖析树中定义了一些好用的搜索方法，可以用这些方法在文档中过滤出你感兴趣的部分。

string：最简单的过滤规则就是string，在查找方法中传递一个string，将会在文档中精确的查找这个string标签。soup.find_all('b')

regular expression：你也可以传递一个正则对象，Beautiful Soup会使用match()方法去匹配该正则。soup.find_all(re.compile("t"))

list：你也可以传入一个list，这样就会匹配其中的任何一个元素。soup.find_all(["a", "b"])

true：这是一个特殊值，表示匹配任何标签。

下面分析搜索树中的一个方法：find_all()，find_all()方法在文档中查找符合过滤规则的所有标签。

find_all(name, attrs, recursive, text, limit, **kwargs)

name：给name传递一个值，Beautiful Soup会认为这个值是某个标签的名称。name的值可以是以上介绍的几种方法。

recursive：Beautiful Soup在某个tag下面匹配过滤规则时，会检递归的检查该tag下的所有子tag，如果你只想匹配直接子tag，可以设置recursive=false。

text：给text指定一个值，用他来搜索strings，而不是搜索tag，虽然text是用来搜索string的，但是也可以和tag混合使用。soup.find_all("a", text="Elsie")

limit：find_all()会返回所有匹配tag或者text的内容，如果你不需要所有的匹配的内容，而是只需要前几个，可以使用limit参数限制。

kwargs：设置标签的属性值，以字典的形式出现，可以传入多个值。soup.find_all(href=re.compile("elsie"), id='link1')

attrs：如果你有一个文档，它有一个标签定义了一个name属性,会怎么样？你不能使用name为keyword参数，因为Beautiful Soup 已经定义了一个name参数使用。你也不能用一个Python的保留字例如for作为关键字参数。Beautiful Soup提供了一个特殊的参数attrs，你可以使用它来应付这些情况。attrs是一个字典，用起来就和keyword参数一样。

find()：该函数找到匹配的第一个tag返回。

find_next_siblings()：这个函数使用.next_siblings迭代剩余的siblings。它会返回所有匹配的siblings。而find_next_siblings只会返回第一个匹配的。

find_all_next()：该函数使用.next_elements迭代在该标签之后的所有tag和strings，它返回所有匹配的结果，而find_next()值返回第一个匹配的。

来源：oschina

链接：https://my.oschina.net/u/2306127/blog/600139

标签

python

beautifulsoap

网页分析