html-parsing | 易学教程

How to extract html table by using Beautifulsoup

阅读更多关于 How to extract html table by using Beautifulsoup

问题 Taking the below html snippet as example: >>>soup <table> <tr><td class="abc">This is ABC</td> </tr> <tr><td class="firstdata"> data1_xxx </td> </tr> </table> <table> <tr><td class="efg">This is EFG</td> </tr> <tr><td class="firstdata"> data1_xxx </td> </tr> </table> If I can only find my desire table by its table data class, >>>soup.findAll("td",{"class":"abc"}) [<td class="abc">This is ABC</td>] how can I extract the whole table as below? <table> <tr><td class="abc">This is ABC</td> </tr>

DOMParser.parseFromString(text,“text/html”) only interprets the first ~21,500 Bytes. Is this a bug?

阅读更多关于 DOMParser.parseFromString(text,“text/html”) only interprets the first ~21,500 Bytes. Is this a bug?

问题 I have Win 7, 64 Bit, Firefox 32.0.1, Noscript running. The following code only returns 199 nodes with aXML.getElementsByTagName("node"), whereas there are 300 in the parsed text, which is not well formed xml. var atext = ''; for (var i=0;i<300;i++) { atext += ' <node id="'+i+'" lat="42.5168939" lon="1.553855" version="2" changeset="21730124"/>'+"\n\r"; } parser = new DOMParser(); aXML= parser.parseFromString(atext , "text/html"); console.log(" nodes: " + aXML.getElementsByTagName("node")

Parse inner HTML

阅读更多关于 Parse inner HTML

问题 This is what I want to parse <div class="photoBox pB-ms"> <a href="/user_details?userid=ePDZ9HuMGWR7vs3kLfj3Gg"> <img width="100" height="100" alt="Photo of Debbie K." src="http://s3-media2.px.yelpcdn.com/photo/xZab5rpdueTCJJuUiBlauA/ms.jpg"> </a> </div> I am using following XPath to find it HtmlNodeCollection bodyNode = htmlDoc.DocumentNode.SelectNodes("//div[@class='photoBox pB-ms']"); This is fine and return,s me all div,s with photobox class But when I want to get ahref using

Dynamically Exclude Content In PHP Simple HTML DOM Parser

阅读更多关于 Dynamically Exclude Content In PHP Simple HTML DOM Parser

问题 I am making a PHP-based application which will fetch content from a site using the PHP Simple HTML DOM Parser. I want to exclude some text between two HTML tags from the content dynamically. If the source code of the content is: Some description or content ETC ABC <span class="s"> May 3 2009 <b> ABC Some Text </b> Some photo or video... I want to remove all the text wrapped by <span class="s"> to the first <b> HTML tag, so the output will be: Some description or content ETC ABC <span class="s

Java/Android how to get JSON from a html response?

阅读更多关于 Java/Android how to get JSON from a html response?

问题 I'm getting an html response from HttpGet,the response is like following: <div class="noti-contents"> <button class="accept-invitation gui-button" data-invite="pi:103158:18:60:114779" data-invite-details='{"f":"103158","p":18,"api":false,"pid":60,"t":114779,"sub":"p10315857a3f8","u":{"id":"103158","name":"xxxxxx","profile_image":"{1}","status":"1"}}'><span>Accept</span></button> </div> and all the above code are stored in a string variable 'response'; but now in my app i only need the JSON

How do you parse and process HTML/XML in PHP?

阅读更多关于 How do you parse and process HTML/XML in PHP?

问题 How can one parse HTML/XML and extract information from it? 回答1: Native XML Extensions I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup. DOM The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs

BeautifulSoup not working, getting NoneType error

阅读更多关于 BeautifulSoup not working, getting NoneType error

问题 I am using the following code (Taken from retrieve links from web page using python and BeautifulSoup): import httplib2 from BeautifulSoup import BeautifulSoup, SoupStrainer http = httplib2.Http() status, response = http.request('http://www.nytimes.com') for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')): if link.has_attr('href'): print link['href'] However, I don't understand why I am getting the following error message: Traceback (most recent call last): File "C:\Users

Python:Getting text from html using Beautifulsoup

阅读更多关于 Python:Getting text from html using Beautifulsoup

问题 I am trying to extract the ranking text number from this link link example: kaggle user ranking no1. More clear in an image: I am using the following code: def get_single_item_data(item_url): sourceCode = requests.get(item_url) plainText = sourceCode.text soup = BeautifulSoup(plainText) for item_name in soup.findAll('h4',{'data-bind':"text: rankingText"}): print(item_name.string) item_url = 'https://www.kaggle.com/titericz' get_single_item_data(item_url) The result is None . The problem is

Open source html parsing class not properly parsing spaces between paragraphs

阅读更多关于 Open source html parsing class not properly parsing spaces between paragraphs

问题 I'm using an open source method that parses the html text into an NSString. The resulting strings have large amounts of white space between the first couple of paragraphs, but only one line of space for subsequent paragraphs. Here is an example of an output. Below is the method I'm calling. I've only changed two lines of the code. For stopCharacters and newLineAndWhitespaceCharacters , I removed /n from the characterset because when it was included, the entire text was one long paragraph. -

Beautiful Soup: Extracting href from HTML ordered list

阅读更多关于 Beautiful Soup: Extracting href from HTML ordered list

问题 I am attempting to extract the URLs from within a HTML ordered list using the BeautifulSoup python module. My code returns a list of NONE values equal in number to the number of items from the ordered list so I know I'm in the right place in the document. What am I doing wrong? The URL I am scraping from is http://www.dailykos.com/story/2013/04/27/1203495/-GunFAIL-XV Here are 5 of 50 lines from the HTML list (apologies for the length): > `<div id="body" class="article-body"> <ol> <li><a href=