html-parsing

How to extract html table by using Beautifulsoup

不打扰是莪最后的温柔 提交于 2019-12-13 01:29:54
问题 Taking the below html snippet as example: >>>soup <table> <tr><td class="abc">This is ABC</td> </tr> <tr><td class="firstdata"> data1_xxx </td> </tr> </table> <table> <tr><td class="efg">This is EFG</td> </tr> <tr><td class="firstdata"> data1_xxx </td> </tr> </table> If I can only find my desire table by its table data class, >>>soup.findAll("td",{"class":"abc"}) [<td class="abc">This is ABC</td>] how can I extract the whole table as below? <table> <tr><td class="abc">This is ABC</td> </tr>

DOMParser.parseFromString(text,“text/html”) only interprets the first ~21,500 Bytes. Is this a bug?

天涯浪子 提交于 2019-12-12 22:09:30
问题 I have Win 7, 64 Bit, Firefox 32.0.1, Noscript running. The following code only returns 199 nodes with aXML.getElementsByTagName("node"), whereas there are 300 in the parsed text, which is not well formed xml. var atext = ''; for (var i=0;i<300;i++) { atext += ' <node id="'+i+'" lat="42.5168939" lon="1.553855" version="2" changeset="21730124"/>'+"\n\r"; } parser = new DOMParser(); aXML= parser.parseFromString(atext , "text/html"); console.log(" nodes: " + aXML.getElementsByTagName("node")

Parse inner HTML

大城市里の小女人 提交于 2019-12-12 21:46:27
问题 This is what I want to parse <div class="photoBox pB-ms"> <a href="/user_details?userid=ePDZ9HuMGWR7vs3kLfj3Gg"> <img width="100" height="100" alt="Photo of Debbie K." src="http://s3-media2.px.yelpcdn.com/photo/xZab5rpdueTCJJuUiBlauA/ms.jpg"> </a> </div> I am using following XPath to find it HtmlNodeCollection bodyNode = htmlDoc.DocumentNode.SelectNodes("//div[@class='photoBox pB-ms']"); This is fine and return,s me all div,s with photobox class But when I want to get ahref using

Dynamically Exclude Content In PHP Simple HTML DOM Parser

大兔子大兔子 提交于 2019-12-12 18:52:57
问题 I am making a PHP-based application which will fetch content from a site using the PHP Simple HTML DOM Parser. I want to exclude some text between two HTML tags from the content dynamically. If the source code of the content is: Some description or content ETC ABC <span class="s"> May 3 2009 <b> ABC Some Text </b> Some photo or video... I want to remove all the text wrapped by <span class="s"> to the first <b> HTML tag, so the output will be: Some description or content ETC ABC <span class="s

Java/Android how to get JSON from a html response?

喜你入骨 提交于 2019-12-12 17:31:25
问题 I'm getting an html response from HttpGet,the response is like following: <div class="noti-contents"> <button class="accept-invitation gui-button" data-invite="pi:103158:18:60:114779" data-invite-details='{"f":"103158","p":18,"api":false,"pid":60,"t":114779,"sub":"p10315857a3f8","u":{"id":"103158","name":"xxxxxx","profile_image":"{1}","status":"1"}}'><span>Accept</span></button> </div> and all the above code are stored in a string variable 'response'; but now in my app i only need the JSON

How do you parse and process HTML/XML in PHP?

杀马特。学长 韩版系。学妹 提交于 2019-12-12 17:31:17
问题 How can one parse HTML/XML and extract information from it? 回答1: Native XML Extensions I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup. DOM The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs

BeautifulSoup not working, getting NoneType error

家住魔仙堡 提交于 2019-12-12 16:41:05
问题 I am using the following code (Taken from retrieve links from web page using python and BeautifulSoup): import httplib2 from BeautifulSoup import BeautifulSoup, SoupStrainer http = httplib2.Http() status, response = http.request('http://www.nytimes.com') for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')): if link.has_attr('href'): print link['href'] However, I don't understand why I am getting the following error message: Traceback (most recent call last): File "C:\Users

Python:Getting text from html using Beautifulsoup

白昼怎懂夜的黑 提交于 2019-12-12 12:24:27
问题 I am trying to extract the ranking text number from this link link example: kaggle user ranking no1. More clear in an image: I am using the following code: def get_single_item_data(item_url): sourceCode = requests.get(item_url) plainText = sourceCode.text soup = BeautifulSoup(plainText) for item_name in soup.findAll('h4',{'data-bind':"text: rankingText"}): print(item_name.string) item_url = 'https://www.kaggle.com/titericz' get_single_item_data(item_url) The result is None . The problem is

Open source html parsing class not properly parsing spaces between paragraphs

假装没事ソ 提交于 2019-12-12 11:02:53
问题 I'm using an open source method that parses the html text into an NSString. The resulting strings have large amounts of white space between the first couple of paragraphs, but only one line of space for subsequent paragraphs. Here is an example of an output. Below is the method I'm calling. I've only changed two lines of the code. For stopCharacters and newLineAndWhitespaceCharacters , I removed /n from the characterset because when it was included, the entire text was one long paragraph. -

Beautiful Soup: Extracting href from HTML ordered list

守給你的承諾、 提交于 2019-12-12 10:06:43
问题 I am attempting to extract the URLs from within a HTML ordered list using the BeautifulSoup python module. My code returns a list of NONE values equal in number to the number of items from the ordered list so I know I'm in the right place in the document. What am I doing wrong? The URL I am scraping from is http://www.dailykos.com/story/2013/04/27/1203495/-GunFAIL-XV Here are 5 of 50 lines from the HTML list (apologies for the length): > `<div id="body" class="article-body"> <ol> <li><a href=