html-parsing

Php parsed html table and count specific <td> similar to another

拥有回忆 提交于 2019-12-25 08:01:24
问题 This question follows another, just solved here Now I want to do a different count, more difficult to figure out. In my parsed HTML table, every rows contains two very similar, and consequential, 'td' ( number 4 and 5 ): <tr> (1) <td class="tdClass" ....</td> (2) <td class="tdClass" ....</td> (3) <td class="tdClass" ....</td> (4) <td class="tdClass" align="center" nowrap="">No</td> (5) <td class="tdClass" align="center" nowrap="">No</td> </tr> The strings could be "No" in the first 'td' and

Removing characters from a variable created using preg_replace

有些话、适合烂在心里 提交于 2019-12-25 07:22:07
问题 So I'm trying to hack off a few characters at the end of a URL I'm getting from a preg_replace function. However it doesn't seem to be working. I'm not familiar with using these variables in preg_replace (it was just something I found that "mostly" worked). Here's my attempt: function addlink_replace($string) { $pattern = '/<ul(.*?)class="slides"(.*?)<img(.*?)src="(.*?)"(.*?)>(.*?)<\/ul>/is'; $URL = substr($4, 0, -8);; $replacement = '<ul$1class="slides"$2<a rel=\'shadowbox\' href="'.$URL.'">

How can i parse html file in windows phone 7?

点点圈 提交于 2019-12-25 07:22:00
问题 Hi am using xml file given below,i want to parse html file . <Description> <Fullcontent> <div id="container" class="cf"> <link rel="stylesheet" href="http://dev2.mercuryminds.com/imageslider/css/demo.css" type="text/css" media="screen" /> <ul class="slides"> <li>Sonam Kapoor<img src="http://deys.jpeg"/></li> <li>Amithab<img src="http://deysAmithab.jpeg"/></li> <li>sridevi<img src="http://deyssridevi.jpeg"/></li> <li>anil-kapoor<img src="http://deysanil-kapoor.jpeg"/></li> </ul> </div> <

Parse HTML and get multidimensional array with date wise using regex (scraping data)?

蹲街弑〆低调 提交于 2019-12-25 07:20:20
问题 I'm trying to group the results i get date wise. Please refer my previous question. How to ignore http link in string and return everything else? Basically right now i get the schedule list but that doesn't include any date in it, So it's hard to understand which event is going to go live on which date and time, it's confusing people because of no date as it shows same timing for multiple events which is actually going to go live on a different date. From the previous question, I got a

Is the conversion from HTML to DOM and back to HTML standardized?

左心房为你撑大大i 提交于 2019-12-25 05:23:26
问题 I'm working on an rich-text editor that will be using ContentEditable. It's imperative that a document that is loaded into the browser (from the web server) is not altered in any way by the conversion to DOM, and then back to HTML alone (assuming the user has not made any changes). It's alright if the HTML document is modified the first time it's created and saved by a browser, but subsequently should not occur again, which simply requires that all browsers will produce the same DOM based on

Extending a basic web crawler to filter status codes and HTML

落花浮王杯 提交于 2019-12-25 05:23:11
问题 I followed a tutorial on writing a basic web crawler in Java and have got something with basic functionality. At the moment it just retrieves the HTML from the site and prints it to the console. I was hoping to extend it so it can filter out specifics like the HTML page title and the HTTP status code? I found this library: http://htmlparser.sourceforge.net/ ... which I think might be able to do the job for me but could I do it without using an external library? Here's what I have so far:

Is the conversion from HTML to DOM and back to HTML standardized?

一个人想着一个人 提交于 2019-12-25 05:23:02
问题 I'm working on an rich-text editor that will be using ContentEditable. It's imperative that a document that is loaded into the browser (from the web server) is not altered in any way by the conversion to DOM, and then back to HTML alone (assuming the user has not made any changes). It's alright if the HTML document is modified the first time it's created and saved by a browser, but subsequently should not occur again, which simply requires that all browsers will produce the same DOM based on

cURL Submitting POST fields after page load ( curl_exec )?

匆匆过客 提交于 2019-12-25 05:03:08
问题 I have to create a bot to collect some data from my college website , it uses simeple login with regno and captcha fields , they dont use real captcha , its a fake one ( can be seen in page source ) . So my idea is to use a DOM Parser and fetch it from eg Im a using PHP-Curl to do this job . My Code:` <? $ch = curl_init(); $captch = i will get the value from DOM Parser ( But here is the problem , i have to get it before even executing the page !! ) $fields = "regno=11BTA00&captcha=$captcha";

Python HTML parsing

独自空忆成欢 提交于 2019-12-25 04:50:26
问题 I am currently trying to make a program that given a word will look up its definition and return it. Although I have gotten this to work, I had to resort to using RegEx to search for the text between the tags where the definitions are stored. What is a more efficient way to do this using python 3.x? 回答1: lxml works for Python 3. It has an ElementTree compatible API, but is using c libraries behind the scenes, so it's fast, and it supports Xpaths, which is a nice way of parsing (sometimes).

Python BeautifulSoup: parsing multiple tables with same class name

雨燕双飞 提交于 2019-12-25 04:24:37
问题 I am trying to parse some tables from a wiki page e.g. http://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2014. there are four tables with same class name "wikitable". When I write: movieList= soup.find('table',{'class':'wikitable'}) rows = movieList.findAll('tr') It works fine, but when I write: movieList= soup.findAll('table',{'class':'wikitable'}) rows = movieList.findAll('tr') It throws an error: Traceback (most recent call last): File "C:\Python27\movieList.py", line 24, in <module>