beautifulsoup | 易学教程

Fast and effective way to parse broken HTML?

阅读更多关于 Fast and effective way to parse broken HTML?

问题 I'm working on large projects which require fast HTML parsing, including recovery for broken HTML pages. Currently lxml is my choice, I know it provides an interface for libxml2's recovery mode, too, but I'm not really happy with the results. For some specific HTML pages I found that BeautifulSoup works out really better results (example: http://fortune.com/2015/11/10/vw-scandal-volkswagen-gift-cards/, this one has a broken <header> tag which lxml/libxml2 couldn't correct). However, the

Finding partial matches in an href tag

阅读更多关于 Finding partial matches in an href tag

问题 I am trying to use Beautiful Soup to find all <a> elements where the href attribute includes a certain string. An example of the full element is: <a href="/markets/NZSX/securities/ABA">ABA</a> I am looking for all elements where href includes "/markets/NZSX/securities/" . I am looking to extract the text from this element. This would be ABA in the example. 回答1: There are several ways to achieve that. With .find_all(): soup.find_all("a", href=re.compile(r"^/markets/NZSX/securities/")) soup

HTML and BeautifulSoup: how to iteratively parse when the structure is not always known beforehand?

阅读更多关于 HTML and BeautifulSoup: how to iteratively parse when the structure is not always known beforehand?

问题 I began with a simple HTML structure, something like this: Thanks to the help of @alecxe, I was able to create this JSON dict: {u'Outer List': {u'Inner List': [u'info 1', u'info 2', u'info 3']}} using his code: from bs4 import BeautifulSoup data = """your html goes here: see the very end of post""" soup = BeautifulSoup(data) inner_ul = soup.find('ul', class_='innerUl') inner_items = [li.text.strip() for li in inner_ul.ul.find_all('li')] outer_ul_text = soup.ul.span.text.strip() inner_ul_text

python SSLError(“bad handshake: SysCallError(-1, 'Unexpected EOF')”,),))

阅读更多关于 python SSLError(“bad handshake: SysCallError(-1, 'Unexpected EOF')”,),))

问题 I was scraping this aspx website https://gra206.aca.ntu.edu.tw/Temp/W2.aspx?Type=2 . As it required, I have to parse in __VIEWSTATE and __EVENTVALIDATION while sending a post request. Now I am trying to send a get request first to have those two values, and then parse then afterward. However, I have tried several times to send a get request. It always turns out throwing this error message: requests.exceptions.SSLError: HTTPSConnectionPool(host='gra206.aca.ntu.edu.tw', port=443): Max retries

Scraping text in h3 and div tags using beautifulSoup, Python

阅读更多关于 Scraping text in h3 and div tags using beautifulSoup, Python

问题 I have no experience with python, BeautifulSoup, Selenium etc. but I'm eager to scrape data from a website and store as a csv file. A single sample of data I need is coded as follows (a single row of data). <div class="box effect"> <div class="row"> <div class="col-lg-10"> <h3>HEADING</h3> <div><i class="fa user"></i> NAME</div> <div><i class="fa phone"></i> MOBILE</div> <div><i class="fa mobile-phone fa-2"></i> NUMBER</div> <div><i class="fa address"></i> XYZ_ADDRESS</div> <div class=

How to select div by text content using Beautiful Soup?

阅读更多关于 How to select div by text content using Beautiful Soup?

问题 Trying to scrape some HTML from something like this. Sometimes the data I need is in div[0], sometimes div[1], etc. Imagine everyone takes 3-5 classes. One of them is always Biology. Their report card is always alphabetized. I want everybody's Biology grade. I've already scraped all this HTML into a text, now how to fish out the Biology grades? <div class = "student"> <div class = "score">Algebra C-</div> <div class = "score">Biology A+</div> <div class = "score">Chemistry B</div> </div> <div

BeautifulSoup - TypeError: 'NoneType' object is not callable

阅读更多关于 BeautifulSoup - TypeError: 'NoneType' object is not callable

问题 I need to make my code backwards compatible with python2.6 and BeautifulSoup 3. My code was written using python2.7 and at this case using BS4. But when I try to run it at squeezy server, I get this error (it has python2.6 and bs3): try: from bs4 import BeautifulSoup except ImportError: from BeautifulSoup import BeautifulSoup gmp = open(fname, 'r') soup = BeautifulSoup(gmp) p = soup.body.div.find_all('p') p = soup.body.div.find_all('p') TypeError: 'NoneType' object is not callable If I change

Get a structure of HTML code

阅读更多关于 Get a structure of HTML code

问题 I'm using BeautifulSoup4 and I'm curious whether is there a function which returns a structure (ordered tags) of the HTML code. Here is an example: <html> <body> <h1>Simple example</h1> <p>This is a simple example of html page</p> </body> </html> print page.structure() : >> <html> <body> <h1></h1> <p></p> </body> </html> I tried to find a solution but no success. Thanks 回答1: There is not, to my knowledge, but a little recursion should work: def taggify(soup): for tag in soup: if isinstance

Batch downloading text and images from URL with Python / urllib / beautifulsoup?

阅读更多关于 Batch downloading text and images from URL with Python / urllib / beautifulsoup?

问题 I have been browsing through several posts here but I just cannot get my head around batch-downloading images and text from a given URL with Python. import urllib,urllib2 import urlparse from BeautifulSoup import BeautifulSoup import os, sys def getAllImages(url): query = urllib2.Request(url) user_agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)" query.add_header("User-Agent", user_agent) page = BeautifulSoup(urllib2.urlopen(query)) for div in

Access next sibling <li> element with BeautifulSoup

阅读更多关于 Access next sibling element with BeautifulSoup

问题 I am completely new to web parsing with Python/BeautifulSoup. I have an HTML that has (part of) the code as follows: <div id="pages"> <ul> <li class="active"><a href="example.com">Example</a></li> <li><a href="example.com">Example</a></li> <li><a href="example1.com">Example 1</a></li> <li><a href="example2.com">Example 2</a></li> </ul> </div> I have to visit each link (basically each <li> element) until there are no more <li> tags present. Each time a link is clicked, its corresponding <li>