beautifulsoup

Fast and effective way to parse broken HTML?

空扰寡人 提交于 2019-12-21 06:08:08
问题 I'm working on large projects which require fast HTML parsing, including recovery for broken HTML pages. Currently lxml is my choice, I know it provides an interface for libxml2's recovery mode, too, but I'm not really happy with the results. For some specific HTML pages I found that BeautifulSoup works out really better results (example: http://fortune.com/2015/11/10/vw-scandal-volkswagen-gift-cards/, this one has a broken <header> tag which lxml/libxml2 couldn't correct). However, the

Finding partial matches in an href tag

℡╲_俬逩灬. 提交于 2019-12-21 06:05:02
问题 I am trying to use Beautiful Soup to find all <a> elements where the href attribute includes a certain string. An example of the full element is: <a href="/markets/NZSX/securities/ABA">ABA</a> I am looking for all elements where href includes "/markets/NZSX/securities/" . I am looking to extract the text from this element. This would be ABA in the example. 回答1: There are several ways to achieve that. With .find_all(): soup.find_all("a", href=re.compile(r"^/markets/NZSX/securities/")) soup

HTML and BeautifulSoup: how to iteratively parse when the structure is not always known beforehand?

て烟熏妆下的殇ゞ 提交于 2019-12-21 05:31:40
问题 I began with a simple HTML structure, something like this: Thanks to the help of @alecxe, I was able to create this JSON dict: {u'Outer List': {u'Inner List': [u'info 1', u'info 2', u'info 3']}} using his code: from bs4 import BeautifulSoup data = """your html goes here: see the very end of post""" soup = BeautifulSoup(data) inner_ul = soup.find('ul', class_='innerUl') inner_items = [li.text.strip() for li in inner_ul.ul.find_all('li')] outer_ul_text = soup.ul.span.text.strip() inner_ul_text

python SSLError(“bad handshake: SysCallError(-1, 'Unexpected EOF')”,),))

纵饮孤独 提交于 2019-12-21 05:13:22
问题 I was scraping this aspx website https://gra206.aca.ntu.edu.tw/Temp/W2.aspx?Type=2 . As it required, I have to parse in __VIEWSTATE and __EVENTVALIDATION while sending a post request. Now I am trying to send a get request first to have those two values, and then parse then afterward. However, I have tried several times to send a get request. It always turns out throwing this error message: requests.exceptions.SSLError: HTTPSConnectionPool(host='gra206.aca.ntu.edu.tw', port=443): Max retries

Scraping text in h3 and div tags using beautifulSoup, Python

早过忘川 提交于 2019-12-21 05:01:05
问题 I have no experience with python, BeautifulSoup, Selenium etc. but I'm eager to scrape data from a website and store as a csv file. A single sample of data I need is coded as follows (a single row of data). <div class="box effect"> <div class="row"> <div class="col-lg-10"> <h3>HEADING</h3> <div><i class="fa user"></i>  NAME</div> <div><i class="fa phone"></i>  MOBILE</div> <div><i class="fa mobile-phone fa-2"></i>   NUMBER</div> <div><i class="fa address"></i>   XYZ_ADDRESS</div> <div class=

How to select div by text content using Beautiful Soup?

痞子三分冷 提交于 2019-12-21 04:35:18
问题 Trying to scrape some HTML from something like this. Sometimes the data I need is in div[0], sometimes div[1], etc. Imagine everyone takes 3-5 classes. One of them is always Biology. Their report card is always alphabetized. I want everybody's Biology grade. I've already scraped all this HTML into a text, now how to fish out the Biology grades? <div class = "student"> <div class = "score">Algebra C-</div> <div class = "score">Biology A+</div> <div class = "score">Chemistry B</div> </div> <div

BeautifulSoup - TypeError: 'NoneType' object is not callable

南楼画角 提交于 2019-12-21 03:43:19
问题 I need to make my code backwards compatible with python2.6 and BeautifulSoup 3. My code was written using python2.7 and at this case using BS4. But when I try to run it at squeezy server, I get this error (it has python2.6 and bs3): try: from bs4 import BeautifulSoup except ImportError: from BeautifulSoup import BeautifulSoup gmp = open(fname, 'r') soup = BeautifulSoup(gmp) p = soup.body.div.find_all('p') p = soup.body.div.find_all('p') TypeError: 'NoneType' object is not callable If I change

Get a structure of HTML code

对着背影说爱祢 提交于 2019-12-21 02:42:22
问题 I'm using BeautifulSoup4 and I'm curious whether is there a function which returns a structure (ordered tags) of the HTML code. Here is an example: <html> <body> <h1>Simple example</h1> <p>This is a simple example of html page</p> </body> </html> print page.structure() : >> <html> <body> <h1></h1> <p></p> </body> </html> I tried to find a solution but no success. Thanks 回答1: There is not, to my knowledge, but a little recursion should work: def taggify(soup): for tag in soup: if isinstance

Batch downloading text and images from URL with Python / urllib / beautifulsoup?

杀马特。学长 韩版系。学妹 提交于 2019-12-21 02:39:07
问题 I have been browsing through several posts here but I just cannot get my head around batch-downloading images and text from a given URL with Python. import urllib,urllib2 import urlparse from BeautifulSoup import BeautifulSoup import os, sys def getAllImages(url): query = urllib2.Request(url) user_agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)" query.add_header("User-Agent", user_agent) page = BeautifulSoup(urllib2.urlopen(query)) for div in

Access next sibling <li> element with BeautifulSoup

本秂侑毒 提交于 2019-12-20 23:22:09
问题 I am completely new to web parsing with Python/BeautifulSoup. I have an HTML that has (part of) the code as follows: <div id="pages"> <ul> <li class="active"><a href="example.com">Example</a></li> <li><a href="example.com">Example</a></li> <li><a href="example1.com">Example 1</a></li> <li><a href="example2.com">Example 2</a></li> </ul> </div> I have to visit each link (basically each <li> element) until there are no more <li> tags present. Each time a link is clicked, its corresponding <li>