beautifulsoup

Filtering out HTML elements which have 'display:none' either as a tag attribute or in their CSS

拜拜、爱过 提交于 2020-01-11 06:27:04
问题 Let's say you have some html source that's been scraped with Selenium, and parsed with BeautifulSoup: from selenium import webdriver from bs4 import BeautifulSoup driver = webdriver.Firefox() driver.get(url) soup = BeautifulSoup(driver.page_source) Is there a way to remove, from the html code or the soup object, all elements which either have: 1.) the attribute style=display:none within the html tag source (i.e. <div style = 'display:none'>...</div> ) or 2.) have the display:none property

Beautifulsoup not returning complete HTML of the page

邮差的信 提交于 2020-01-11 04:19:05
问题 I have been digging on the site for some time and im unable to find the solution to my issue. Im fairly new to web scraping and trying to simply extract some links from a web page using beautiful soup. url = "https://www.sofascore.com/pt/futebol/2018-09-18" page = urlopen(url).read() soup = BeautifulSoup(page, "lxml") print(soup) At the most basic level, all im trying to do is access a specific tag within the website. I can work out the rest for myself, but the part im struggling with is the

Beautiful Soup and extracting a div and its contents by ID

北慕城南 提交于 2020-01-09 05:54:58
问题 soup.find("tagName", { "id" : "articlebody" }) Why does this NOT return the <div id="articlebody"> ... </div> tags and stuff in between? It returns nothing. And I know for a fact it exists because I'm staring right at it from soup.prettify() soup.find("div", { "id" : "articlebody" }) also does not work. Edit: There is no answer to this post - how do I delete it? I found that BeautifulSoup is not parsing correctly, which probably actually means the page I'm trying to parse isn't properly

Why do I get a recursion error with BeautifulSoup and IDLE?

那年仲夏 提交于 2020-01-08 18:04:47
问题 I am following a tutorial to try to learn how to use BeautifulSoup. I am trying to remove names from the urls on a html page I downloaded. I have it working great to this point. from bs4 import BeautifulSoup soup = BeautifulSoup(open("43rd-congress.html")) final_link = soup.p.a final_link.decompose() links = soup.find_all('a') for link in links: print link but when I enter this next part from bs4 import BeautifulSoup soup = BeautifulSoup(open("43rd-congress.html")) final_link = soup.p.a final

Why do I get a recursion error with BeautifulSoup and IDLE?

限于喜欢 提交于 2020-01-08 18:02:58
问题 I am following a tutorial to try to learn how to use BeautifulSoup. I am trying to remove names from the urls on a html page I downloaded. I have it working great to this point. from bs4 import BeautifulSoup soup = BeautifulSoup(open("43rd-congress.html")) final_link = soup.p.a final_link.decompose() links = soup.find_all('a') for link in links: print link but when I enter this next part from bs4 import BeautifulSoup soup = BeautifulSoup(open("43rd-congress.html")) final_link = soup.p.a final

二十年编程语言风云,哪款是你的爱豆?

别说谁变了你拦得住时间么 提交于 2020-01-08 16:13:24
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 点击这里,查看 剩余各大语言的曲线走势,薪资占比等重要内容 简介 :2020了,编程语言也要决出2019年的最佳语言了,会是谁呢,从 TIOBE 上来看,Java、C 和 Python 基本锁定了前三的位置,Java 江湖老大的地位,还是无人能撼动呢。 下面先来一张 TIOBE 网站的走势图,镇楼(这是个暴露年龄的词语)! 数据获取 数据获取的部分,与上一篇 DB 篇很类似,都是解析 JavaScript 代码里的变量,抽出数据即可 def get_pl_data(name): name_lower = [i.lower() for i in name] for i in name_lower: print("Request ", i) if i == 'c#': i = 'csharp' url = 'https://www.tiobe.com/tiobe-index/' + i res = requests.get(url).text content = BeautifulSoup(res, "html.parser") js = content.find_all('script')[9].string src_text = js2xml.parse(js) src_tree = js2xml.pretty

Difference between .string and .text BeautifulSoup

久未见 提交于 2020-01-08 13:16:31
问题 I noticed something odd about when working with BeautifulSoup and couldn't find any documentation to support this so I wanted to ask over here. Say we have a tags like these that we have parsed with BS: <td>Some Table Data</td> <td></td> The official documented way to extract the data is soup.string . However this extracted a NoneType for the second <td> tag. So I tried soup.text (because why not?) and it extracted an empty string exactly as I wanted. However I couldn't find any reference to

Difference between .string and .text BeautifulSoup

谁说胖子不能爱 提交于 2020-01-08 13:12:08
问题 I noticed something odd about when working with BeautifulSoup and couldn't find any documentation to support this so I wanted to ask over here. Say we have a tags like these that we have parsed with BS: <td>Some Table Data</td> <td></td> The official documented way to extract the data is soup.string . However this extracted a NoneType for the second <td> tag. So I tried soup.text (because why not?) and it extracted an empty string exactly as I wanted. However I couldn't find any reference to

python爬虫基础:Beautiful Soup用法详解

自作多情 提交于 2020-01-07 10:34:31
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 前言 说到爬虫,我们不得不提起Beautiful Soup这个爬虫利器,Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.它的官方文档是这样解释的,其实它已经说得非常清楚了,它就就是一个数据提取库 下面来看看,Beautiful Soup使用的演示过程吧 先看下所需网站的HTML标签 可以清楚地看见,文章得我标题都是在a标签当中的,这个可以用find_all('a', 'title') 提起数据了 具体代码如下: 运行结果 还有更多的使用方法,可以去看看关于Beautiful Soup的文档详解 学习从来不是一个人的事情,要有个相互监督的伙伴,工作需要学习python或者有兴趣学习python的伙伴可以私信回复小编“学习” 获取资料,一起学习 来源: oschina 链接: https://my.oschina.net/u/4104998/blog/3043690

一文了解你是否适合学习python?

扶醉桌前 提交于 2020-01-07 10:02:25
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 编程对于任何一个新手来说都不是一件容易的事情,特别是在中国基本以C语言作为启蒙语言的国家。Python对于任何一个想学习的编程的人来说的确是一个福音,阅读Python代码像是在阅读文章,源于Python语言提供了非常优雅的语法,被称为最优雅的语言之一。 如果你是属于以下这几点的人,强烈推荐你学习python! 编程菜鸟新手 非常喜爱编程,以后想从事相关工作,但是零基础,不知道入门选择什么编程语言的朋友,其实是最适合选择Python编程语言的。 网站前端的开发人员 平常只关注div+css这些页面技术,很多时候其实需要与后端开发人员进行交互的。 SEO人员 很多SEO优化的时候,苦于不会编程,一些程序上面的问题,得不到解决,只能做做简单的页面优化。 现在学会Python之后,你和我一样都可以编写一些查询收录,排名,自动生成网络地图的程序,解决棘手的SEO问题。 在校学生 想有一技之长,或者是自学编程的爱好者,希望快速入门,少走弯路,都可以选择Python语言。 Python学习可以分为几个阶段: 第一步:基础 很简单,只要搭建好环境,然后跟着这个网站敲一敲,熟悉一遍基础,不用花太多时间,大概1~2周。 重点学习:初级教程以及高级教程中的正则表达式、MySQL、多线程。 第二步:巩固 找简单的练手的项目