html-parsing | 易学教程

BeautifulSoup fails to parse long view state

阅读更多关于 BeautifulSoup fails to parse long view state

问题 I try to use BeautifulSoup4 to parse the html retrieved from http://exporter.nih.gov/ExPORTER_Catalog.aspx?index=0 If I print out the resulting soup, it ends like this: kZXI9IjAi"/></form></body></html> Searching for the last characters 9IjaI in the raw html, I found that it's in the middle of a huge viewstate. BeautifulSoup seems to have a problem with this. Any hint what I might be doing wrong or how to parse such a page? 回答1: BeautifulSoup uses a pluggable HTML parser to build the 'soup';

Parsing HTML - How to get a number from a tag?

阅读更多关于 Parsing HTML - How to get a number from a tag?

问题 I am developing a Windows Forms application which is interacting with a web site. Using a WebBrowser control I am controlling the web site and I can iterate through the tags using: HtmlDocument webDoc1 = this.webBrowser1.Document; HtmlElementCollection aTags = webDoc1.GetElementsByTagName("a"); Now, I want to get a particular text from the tag which is below: <a href="issue?status=-1,1,2,3,4,5,6,7&@sort=-activity&@search_text=&@dispname=Show Assigned&@filter=status,assignedto&@group=priority&

Regex within html tags

阅读更多关于 Regex within html tags

问题 I would like to parse the HD price from the following snipper of HTML. I am only have fragments of the html code, so I cannot use an HTML parser for this. <div id="left-stack"> <span>View In iTunes</span></a> <span class="price">£19.99</span> <ul class="list"> <li>HD Version</li> Basically, the format would be to "Find the price before the word "HD Version" (case insensitive). Here is what I have so far: re.match(r'^(\d|.){1,6}...HD\sVersion', string) How would I extract the value "19.99"

Issue with html tags while scraping data using beautiful soup

阅读更多关于 Issue with html tags while scraping data using beautiful soup

问题 Common piece of code: # -*- coding: cp1252 -*- import csv import urllib2 import sys import time from bs4 import BeautifulSoup from itertools import islice page = urllib2.urlopen('http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html').read() soup = BeautifulSoup(page) prices = soup.findAll('div', {"class": "price"}) After this I am trying following codes to get data: Code 1: for price in prices: print unicode(price.string).encode('utf8') Output1: No Output, code runs without any

JSOUP HTML Parser

阅读更多关于 JSOUP HTML Parser

问题 Is there a way to get start line & column number and end line & column number of element/tag ? I am creating HTML editor that needs to highlight tag for speed optimization based on some scenario by given start and end line & column number . 回答1: No, unfortunately this is not possible with jsoup at the current time. At the moment Jsoup does not track line numbers / character positions when parsing, so it's not possible to extract them. As this is not a core use case, I don't want to extend the

DomParser parseFromString removing nodes

阅读更多关于 DomParser parseFromString removing nodes

问题 I came across some strange behaviour when using the DomParser. It appears that if the first element is a TEMPLATE, it's ignored. See the output of below: printTags('<template></template><h1></h1>', 'text/html'); document.write('<hr>') printTags('<h1></h1><template></template>', 'text/html'); function printTags(str) { let doc = new DOMParser().parseFromString(str, 'text/html'); document.write(Array.from(doc.body.children).map(child => child.tagName).join(',')); } Browser: Chrome 72 Is this

Jsoup.parse() vs. Jsoup.parse() - or How does URL detection work in Jsoup?

阅读更多关于 Jsoup.parse() vs. Jsoup.parse() - or How does URL detection work in Jsoup?

问题 Jsoup has 2 html parse() methods: parse(String html) - "As no base URI is specified, absolute URL detection relies on the HTML including a tag." parse(String html, String baseUri) - "The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur before the HTML declares a tag." I am having a difficulty understanding the meaning of the difference between the two: In the 2nd parse() version, what does "resolve relative URLs to absolute URLs, that occur

Extending CSS selectors in BeautifulSoup

阅读更多关于 Extending CSS selectors in BeautifulSoup

问题 The Question: BeautifulSoup provides a very limited support for CSS selectors . For instance, the only supported pseudo-class is nth-of-type and it can only accept numerical values - arguments like even or odd are not allowed. Is it possible to extend BeautifulSoup CSS selectors or let it use lxml.cssselect internally as an underlying CSS selection mechanism? Let's take a look at an example problem/use case . Locate only even rows in the following HTML: <table> <tr> <td>1</td> <tr> <td>2</td>

PHP Get contents of webpage

阅读更多关于 PHP Get contents of webpage

问题 So I am using the PHP Simple HTML DOM Parser to get the contents of a webpage. After I knew what I was doing was right, I still got the error that there was nothing to be found. So here's what I am using to see if there is anything actually being caught: <?php include_once('simple_html_dom.php'); error_reporting(E_ALL); ini_set('display_errors', '1'); $first_url = "http://www.transfermarkt.co.uk/en/chinese-super-league/startseite/wettbewerb_CSL.html"; // works $html = file_get_html($first_url

Getting more granular diffs from difflib (or a way to post-process a diff to achieve the same thing)

阅读更多关于 Getting more granular diffs from difflib (or a way to post-process a diff to achieve the same thing)

问题 Downloading this page and making a minor edit to it, changing the first 65 in this paragraph to 68 : I then parse both sources with BeauifulSoup and diff them with difflib . url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM' response = urllib2.urlopen(url) content = response.read() # get response as list of lines url2 = 'file:///Users/Pyderman/projects/temp/02092016062645AM-modified.html' response2 = urllib2.urlopen(url2) content2 = response2.read() # get response as