html-parsing | 易学教程

Php parsed html table and count specific <td> similar to another

阅读更多关于 Php parsed html table and count specific similar to another

问题 This question follows another, just solved here Now I want to do a different count, more difficult to figure out. In my parsed HTML table, every rows contains two very similar, and consequential, 'td' ( number 4 and 5 ): <tr> (1) <td class="tdClass" ....</td> (2) <td class="tdClass" ....</td> (3) <td class="tdClass" ....</td> (4) <td class="tdClass" align="center" nowrap="">No</td> (5) <td class="tdClass" align="center" nowrap="">No</td> </tr> The strings could be "No" in the first 'td' and

Removing characters from a variable created using preg_replace

阅读更多关于 Removing characters from a variable created using preg_replace

问题 So I'm trying to hack off a few characters at the end of a URL I'm getting from a preg_replace function. However it doesn't seem to be working. I'm not familiar with using these variables in preg_replace (it was just something I found that "mostly" worked). Here's my attempt: function addlink_replace($string) { $pattern = '/<ul(.*?)class="slides"(.*?)<img(.*?)src="(.*?)"(.*?)>(.*?)<\/ul>/is'; $URL = substr($4, 0, -8);; $replacement = '<ul$1class="slides"$2<a rel=\'shadowbox\' href="'.$URL.'">

How can i parse html file in windows phone 7?

阅读更多关于 How can i parse html file in windows phone 7?

问题 Hi am using xml file given below,i want to parse html file . <Description> <Fullcontent> <div id="container" class="cf"> <link rel="stylesheet" href="http://dev2.mercuryminds.com/imageslider/css/demo.css" type="text/css" media="screen" /> <ul class="slides"> <li>Sonam Kapoor<img src="http://deys.jpeg"/></li> <li>Amithab<img src="http://deysAmithab.jpeg"/></li> <li>sridevi<img src="http://deyssridevi.jpeg"/></li> <li>anil-kapoor<img src="http://deysanil-kapoor.jpeg"/></li> </ul> </div> <

Parse HTML and get multidimensional array with date wise using regex (scraping data)?

阅读更多关于 Parse HTML and get multidimensional array with date wise using regex (scraping data)?

问题 I'm trying to group the results i get date wise. Please refer my previous question. How to ignore http link in string and return everything else? Basically right now i get the schedule list but that doesn't include any date in it, So it's hard to understand which event is going to go live on which date and time, it's confusing people because of no date as it shows same timing for multiple events which is actually going to go live on a different date. From the previous question, I got a

Is the conversion from HTML to DOM and back to HTML standardized?

阅读更多关于 Is the conversion from HTML to DOM and back to HTML standardized?

问题 I'm working on an rich-text editor that will be using ContentEditable. It's imperative that a document that is loaded into the browser (from the web server) is not altered in any way by the conversion to DOM, and then back to HTML alone (assuming the user has not made any changes). It's alright if the HTML document is modified the first time it's created and saved by a browser, but subsequently should not occur again, which simply requires that all browsers will produce the same DOM based on

Extending a basic web crawler to filter status codes and HTML

阅读更多关于 Extending a basic web crawler to filter status codes and HTML

问题 I followed a tutorial on writing a basic web crawler in Java and have got something with basic functionality. At the moment it just retrieves the HTML from the site and prints it to the console. I was hoping to extend it so it can filter out specifics like the HTML page title and the HTTP status code? I found this library: http://htmlparser.sourceforge.net/ ... which I think might be able to do the job for me but could I do it without using an external library? Here's what I have so far:

Is the conversion from HTML to DOM and back to HTML standardized?

阅读更多关于 Is the conversion from HTML to DOM and back to HTML standardized?

cURL Submitting POST fields after page load ( curl_exec )?

阅读更多关于 cURL Submitting POST fields after page load ( curl_exec )?

问题 I have to create a bot to collect some data from my college website , it uses simeple login with regno and captcha fields , they dont use real captcha , its a fake one ( can be seen in page source ) . So my idea is to use a DOM Parser and fetch it from eg Im a using PHP-Curl to do this job . My Code:` <? $ch = curl_init(); $captch = i will get the value from DOM Parser ( But here is the problem , i have to get it before even executing the page !! ) $fields = "regno=11BTA00&captcha=$captcha";

Python HTML parsing

阅读更多关于 Python HTML parsing

问题 I am currently trying to make a program that given a word will look up its definition and return it. Although I have gotten this to work, I had to resort to using RegEx to search for the text between the tags where the definitions are stored. What is a more efficient way to do this using python 3.x? 回答1: lxml works for Python 3. It has an ElementTree compatible API, but is using c libraries behind the scenes, so it's fast, and it supports Xpaths, which is a nice way of parsing (sometimes).

Python BeautifulSoup: parsing multiple tables with same class name

阅读更多关于 Python BeautifulSoup: parsing multiple tables with same class name

问题 I am trying to parse some tables from a wiki page e.g. http://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2014. there are four tables with same class name "wikitable". When I write: movieList= soup.find('table',{'class':'wikitable'}) rows = movieList.findAll('tr') It works fine, but when I write: movieList= soup.findAll('table',{'class':'wikitable'}) rows = movieList.findAll('tr') It throws an error: Traceback (most recent call last): File "C:\Python27\movieList.py", line 24, in <module>