html-parsing | 易学教程

Parse Website for URLs

阅读更多关于 Parse Website for URLs

问题 Just wondering if someone can help me further with the following. I want to parse the URL on this website:http://www.directorycritic.com/free-directory-list.html?pg=1&sort=pr I have the following code: <?PHP $url = "http://www.directorycritic.com/free-directory-list.html?pg=1&sort=pr"; $input = @file_get_contents($url) or die("Could not access file: $url"); $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches)) { // $matches[2] =

A JavaScript parser for DOM

阅读更多关于 A JavaScript parser for DOM

问题 We have a special requirement in a project where we have to parse a string of HTML (from an AJAX response) client side via JavaScript only . Thats right no parsing in PHP or Java! I've been going through StackOverflow, this entire week and have yet not got an acceptable solution. Some more details on the requirements: We can use any library (preferably dojo and / or jQuery) or go native! We need to parse an Entire HTML Document that we receive as a string , including the <head> and <body> .

Using XPath Contains against HTML in Java

阅读更多关于 Using XPath Contains against HTML in Java

问题 I'm scraping values from HTML pages using XPath inside of a java program to get to a specific tag and occasionally using regular expressions to clean up the data I receive. After some research, I landed on HTML Cleaner ( http://htmlcleaner.sourceforge.net/ ) as the most reliable way to parse raw HTML into a good XML format. HTML Cleaner, however, only supports XPath 1.0 and I find myself needing functions like 'contains'. for instance, in this piece of XML: <div> <td id='1234 foo 5678'>Hello<

Wordwrap / Cut Text in HTML string

阅读更多关于 Wordwrap / Cut Text in HTML string

问题 here what i want to do : i have a string containing HTML tags and i want to cut it using the wordwrap function excluding HTML tags. I'm stuck : public function textWrap($string, $width) { $dom = new DOMDocument(); $dom->loadHTML($string); foreach ($dom->getElementsByTagName('*') as $elem) { foreach ($elem->childNodes as $node) { if ($node->nodeType === XML_TEXT_NODE) { $text = trim($node->nodeValue); $length = mb_strlen($text); $width -= $length; if($width <= 0) { // Here, I would like to

Convert html to plain text in VBA

阅读更多关于 Convert html to plain text in VBA

问题 I have an Excel sheet with cells containing html. How can I batch convert them to plaintext? At the moment there are so many useless tags and styles. I want to write it from scratch but it will be far easier if I can get the plain text out. I can write a script to convert html to plain text in PHP so if you can't think of a solution in VBA then maybe you can sugest how I might pass the cells data to a website and retrieve the data back. 回答1: Set a reference to "Microsoft HTML object library".

Convert html to plain text in VBA

阅读更多关于 Convert html to plain text in VBA

How to extract a JSON object that was defined in a HTML page javascript block using Python?

阅读更多关于 How to extract a JSON object that was defined in a HTML page javascript block using Python?

问题 I am downloading HTML pages that have data defined in them in the following way: ... <script type= "text/javascript"> window.blog.data = {"activity":{"type":"read"}}; </script> ... I would like to extract the JSON object defined in 'window.blog.data'. Is there a simpler way than parsing it manually? (I am looking into Beautiful Soap but can't seem to find a method that will return the exact object without parsing) Thanks Edit: Would it be possible and more correct to do this with a python

How do HTML parses work if they're not using regexp?

阅读更多关于 How do HTML parses work if they're not using regexp?

问题 I see questions every day asking how to parse or extract something from some HTML string and the first answer/comment is always "Don't use RegEx to parse HTML, lest you feel the wrath!" (that last part is sometimes omitted). This is rather confusing for me, I always thought that in general, the best way to parse any complicated string is to use a regular expression. So how does a HTML parser work? Doesn't it use regular expressions to parse. One particular argument for using a regular

Regex select all text between tags

阅读更多关于 Regex select all text between tags

问题 What is the best way to select all the text between 2 tags - ex: the text between all the 'pre' tags on the page. 回答1: You can use "<pre>(.*?)</pre>" , (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML. As other commenters have suggested, if you're doing something complex, use a HTML parser. 回答2: Tag can be completed in another line. This is

Android HTML ImageGetter as AsyncTask

阅读更多关于 Android HTML ImageGetter as AsyncTask

问题 Okay, I'm losing my mind over this one. I have a method in my program which parses HTML. I want to include the inline images, and I am under the impression that using the Html.fromHtml(string, Html.ImageGetter, Html.TagHandler) will allow this to happen. Since Html.ImageGetter doesn't have an implementation, it's up to me to write one. However, since parsing URLs into Drawables requires network access, I can't do this on the main thread, so it must be an AsyncTask. I think. However, when you