html-parsing

Parse Website for URLs

强颜欢笑 提交于 2019-12-17 17:03:54
问题 Just wondering if someone can help me further with the following. I want to parse the URL on this website:http://www.directorycritic.com/free-directory-list.html?pg=1&sort=pr I have the following code: <?PHP $url = "http://www.directorycritic.com/free-directory-list.html?pg=1&sort=pr"; $input = @file_get_contents($url) or die("Could not access file: $url"); $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches)) { // $matches[2] =

A JavaScript parser for DOM

耗尽温柔 提交于 2019-12-17 16:23:06
问题 We have a special requirement in a project where we have to parse a string of HTML (from an AJAX response) client side via JavaScript only . Thats right no parsing in PHP or Java! I've been going through StackOverflow, this entire week and have yet not got an acceptable solution. Some more details on the requirements: We can use any library (preferably dojo and / or jQuery) or go native! We need to parse an Entire HTML Document that we receive as a string , including the <head> and <body> .

Using XPath Contains against HTML in Java

梦想的初衷 提交于 2019-12-17 15:34:15
问题 I'm scraping values from HTML pages using XPath inside of a java program to get to a specific tag and occasionally using regular expressions to clean up the data I receive. After some research, I landed on HTML Cleaner ( http://htmlcleaner.sourceforge.net/ ) as the most reliable way to parse raw HTML into a good XML format. HTML Cleaner, however, only supports XPath 1.0 and I find myself needing functions like 'contains'. for instance, in this piece of XML: <div> <td id='1234 foo 5678'>Hello<

Wordwrap / Cut Text in HTML string

人走茶凉 提交于 2019-12-17 14:54:01
问题 here what i want to do : i have a string containing HTML tags and i want to cut it using the wordwrap function excluding HTML tags. I'm stuck : public function textWrap($string, $width) { $dom = new DOMDocument(); $dom->loadHTML($string); foreach ($dom->getElementsByTagName('*') as $elem) { foreach ($elem->childNodes as $node) { if ($node->nodeType === XML_TEXT_NODE) { $text = trim($node->nodeValue); $length = mb_strlen($text); $width -= $length; if($width <= 0) { // Here, I would like to

Convert html to plain text in VBA

只愿长相守 提交于 2019-12-17 12:37:23
问题 I have an Excel sheet with cells containing html. How can I batch convert them to plaintext? At the moment there are so many useless tags and styles. I want to write it from scratch but it will be far easier if I can get the plain text out. I can write a script to convert html to plain text in PHP so if you can't think of a solution in VBA then maybe you can sugest how I might pass the cells data to a website and retrieve the data back. 回答1: Set a reference to "Microsoft HTML object library".

Convert html to plain text in VBA

被刻印的时光 ゝ 提交于 2019-12-17 12:36:21
问题 I have an Excel sheet with cells containing html. How can I batch convert them to plaintext? At the moment there are so many useless tags and styles. I want to write it from scratch but it will be far easier if I can get the plain text out. I can write a script to convert html to plain text in PHP so if you can't think of a solution in VBA then maybe you can sugest how I might pass the cells data to a website and retrieve the data back. 回答1: Set a reference to "Microsoft HTML object library".

How to extract a JSON object that was defined in a HTML page javascript block using Python?

帅比萌擦擦* 提交于 2019-12-17 10:49:13
问题 I am downloading HTML pages that have data defined in them in the following way: ... <script type= "text/javascript"> window.blog.data = {"activity":{"type":"read"}}; </script> ... I would like to extract the JSON object defined in 'window.blog.data'. Is there a simpler way than parsing it manually? (I am looking into Beautiful Soap but can't seem to find a method that will return the exact object without parsing) Thanks Edit: Would it be possible and more correct to do this with a python

How do HTML parses work if they're not using regexp?

天涯浪子 提交于 2019-12-17 03:27:06
问题 I see questions every day asking how to parse or extract something from some HTML string and the first answer/comment is always "Don't use RegEx to parse HTML, lest you feel the wrath!" (that last part is sometimes omitted). This is rather confusing for me, I always thought that in general, the best way to parse any complicated string is to use a regular expression. So how does a HTML parser work? Doesn't it use regular expressions to parse. One particular argument for using a regular

Regex select all text between tags

本小妞迷上赌 提交于 2019-12-17 02:58:30
问题 What is the best way to select all the text between 2 tags - ex: the text between all the 'pre' tags on the page. 回答1: You can use "<pre>(.*?)</pre>" , (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML. As other commenters have suggested, if you're doing something complex, use a HTML parser. 回答2: Tag can be completed in another line. This is

Android HTML ImageGetter as AsyncTask

て烟熏妆下的殇ゞ 提交于 2019-12-17 01:43:08
问题 Okay, I'm losing my mind over this one. I have a method in my program which parses HTML. I want to include the inline images, and I am under the impression that using the Html.fromHtml(string, Html.ImageGetter, Html.TagHandler) will allow this to happen. Since Html.ImageGetter doesn't have an implementation, it's up to me to write one. However, since parsing URLs into Drawables requires network access, I can't do this on the main thread, so it must be an AsyncTask. I think. However, when you