bs4

How to find all comments with Beautiful Soup

社会主义新天地 提交于 2019-11-27 01:53:24
This question was asked four years ago, but the answer is now out of date for BS4. I want to delete all comments in my html file using beautiful soup. Since BS4 makes each comment as a special type of navigable string , I thought this code would work: for comments in soup.find_all('comment'): comments.decompose() So that didn't work.... How do I find all comments using BS4? Flickerlight You can pass a function to find_all() to help it check whether the string is a Comment. For example I have below html: <body> <!-- Branding and main navigation --> <div class="Branding">The Science & Safety

BeautifulSoup安装及其应用

流过昼夜 提交于 2019-11-26 18:16:17
BeautifulSoup 安装及其使用 BeautifulSoup 是个好东东。 官网见这里: http://www.crummy.com/software/BeautifulSoup/ 下载地址见这里: http://www.crummy.com/software/BeautifulSoup/bs4/download/4.1/ , 附件有4.1.2的安装源码 文档见这里: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html ,是中文翻译的,不过文档有点旧,是 3.0 的文档版本,看起来没有什么意思。 我推荐大家看个: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ ,这个是 python 的官网英文版,看起来要舒服,清晰很多。 在 python 下,你想按照 jquery 格式来读取网页,免除网页格式、标签的不规范的困扰,那么 BeautifulSoup 是个不错的选择。按照官网所说, BeautifulSoup 是 Screen-Scraping 应用,旨在节省大家处理 HTML 标签,并且从网络中获得信息的工程。 BeautifulSoup 有这么几个优点,使得其功能尤其强大: 1 : Beautiful Soup

Extract `src` attribute from `img` tag using BeautifulSoup

半城伤御伤魂 提交于 2019-11-26 17:05:14
问题 <div class="someClass"> <a href="href"> <img alt="some" src="some"/> </a> </div> I use bs4 and I cannot use a.attrs['src'] to get the src , but I can get href . What should I do? 回答1: You can use BeautifulSoup to extract src attribute of an html img tag. In my example, the htmlText contains the img tag itself but this can be used for a URL too along with urllib2 . For URLs from BeautifulSoup import BeautifulSoup as BSHTML import urllib2 page = urllib2.urlopen('http://www.youtube.com/') soup =