beautifulsoup

采集数据的html解析方法

半城伤御伤魂 提交于 2020-11-10 18:40:10
通过爬虫请求url一般会获取html数据,需要快捷进行文档解析,定位获取元素数据。 Beautiful Soup 能够从HTML或XML文件中提取数据的Python库.可以通过转换器实现惯用的文档导航,查找,修改文档的方法。Beautiful Soup会极大的提高文档分析效率,减少研发的投入时间。下面将展示BeautifulSoup4中所有主要特性,表明它适合做什么,如何工作和使用,并到达想要的效果和处理异常情况. html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com

爬虫之汽车之家(request)

蹲街弑〆低调 提交于 2020-11-08 18:25:01
# !/usr/bin/env python # -*- coding:utf-8 -*- # requests:用来下载网页源代码的,等同urlopen() # Beautiful Soup,解析html,替代正则部分re # Html # BeautifulSoup().find("a") import requests import bs4 from bs4 import BeautifulSoup # 拿到汽车之家的首页源代码 # urlopen(url).read.decode main_page_content=requests.get( " https://www.autohome.com.cn/weifang/ " ).text # 把页面源代码交给bs4解析 main_page=BeautifulSoup(main_page_content, " html.parser " ) # 可以进行标签的定位 main_div=main_page.find(name= " div " ,attrs={ " class " : " people-content " }) main_ul =main_div.find(name= " ul " ,attrs={ " class " : " list-text " }) main_a_lst =main_ul.find_all(