html-parsing | 易学教程

How to extract all contents inside a div from HTML string in JavaScript

阅读更多关于 How to extract all contents inside a div from HTML string in JavaScript

问题 I have a HTML string like this :- var html = '<div id="parent_div"> <div id="child_div"> <ul> <li><img src="wow/img1.jpg" /><a href="http://wow.com">wow link</a></li> <li><img src="wow/img2.jpg" /><a href="http://wow.com">wow link</a></li> </ul> </div> </div>'; How to extract all the contents inside the <div id="parent_div"> ? 回答1: In jQuery you could just do $($html).html(); In php you could use something like Simple HTML DOM 回答2: Take a look at DOM or the other PHP XML libs. http://se.php

Using DOMDocument to Parse HTML with JS code

阅读更多关于 Using DOMDocument to Parse HTML with JS code

问题 I take HTML in as a string and then I parse it to change all href links to something else. This works however, when the HTML page has some JS script tags i.e. <script> it gets removed! For example this line: <script type="text/javascript" src="/js/jquery.js"></script> Gets Changed to: [removed][removed] However, I would like to keep everything in. This is my function: function parse_html_code($code, $code_id){ libxml_use_internal_errors(true); $xml = new DOMDocument(); $xml->loadHTML($code);

How can I selectively modify the src attributes of script tags in an HTML document using Perl?

阅读更多关于 How can I selectively modify the src attributes of script tags in an HTML document using Perl?

问题 I need to write a regular expression in Perl that will prefix all srcs with [perl]texthere[/perl], like such: <script src="[perl]texthere[/perl]/text"></script> Any help? Thanks! 回答1: Use a proper parser such as HTML::TokeParser::Simple: #!/usr/bin/env perl use strict; use warnings; use HTML::TokeParser::Simple; my $parser = HTML::TokeParser::Simple->new(handle => \*DATA); while (my $token = $parser->get_token('script')) { if ($token->is_tag('script') and defined(my $src = $token->get_attr(

Get the value of a html input element as a php string

阅读更多关于 Get the value of a html input element as a php string

问题 I have a html file loaded as a string in php, and I need to get the values of the input elements in the HTML string. Can someone help me build a function which takes the name of the input element and returns its value? This is an example of the function I would like to do: function getVal($name){ $htmlStr = "<form action = \"action.php\"><input type=\"hidden\" name=\"command\" value=\"123456\"> <input type=\"hidden\" name=\"quantity\" value=\"1\"> <input type=\"hidden\" name=\"user_mode\"

Groovy XmlSlurper get value of the node without children

阅读更多关于 Groovy XmlSlurper get value of the node without children

问题 I'm parsing HTML and trying to value of a parent node itself, without values of the children nodes. HTML example: <html> <body> <div> <a href="http://intro.com">extra stuff</a> Text I would like to get. <a href="http://example.com">link to example</a> </div> </body> </html> Code: def tagsoupParser = new org.ccil.cowan.tagsoup.Parser() def slurper = new XmlSlurper(tagsoupParser) def htmlParsed = slurper.parseText(stringToParse) println htmlParsed.body.div[0] However above code returns: extra

Extract text between two <hr> tags in CSS-less HTML

阅读更多关于 Extract text between two tags in CSS-less HTML

问题 Using Jsoup, what would be an optimal approach to extract text, of which its pattern is known ( [number]%%[number] ) but resides in an HTML page that uses neither CSS nor divs, spans, classes or other identifying of any type (yup, old HTML page of which I have no control over)? The only thing that consistently identifies that text segment (and is guaranteed to remain like that) is that is HTML always looks like this (within a larger body of HTML): <hr> 2%%17 <hr> (The number 2 and 17 are

How to get string from HTML with regex?

阅读更多关于 How to get string from HTML with regex?

问题 I'm trying to parse block from html page so i try to preg_match this block with php if( preg_match('<\/div>(.*?)<div class="adsdiv">', $data, $t)) but doesn't work </div> blablabla blablabla blablabla <div class="adsdiv"> i want grep only blablabla blablabla words any help 回答1: Regex aint the right tool for this. Here is how to do it with DOM $html = <<< HTML <div class="parent"> <div> <p>previous div<p> </div> blablabla blablabla blablabla <div class="adsdiv"> <p>other content</p> </div> <

How to determine the language of a website

阅读更多关于 How to determine the language of a website

问题 I have a url of a website and need to find out which language the website uses (whether it's spanish, french, italian, etc). The site's top level domain is .com , and this doesn't help at all. I cannot simply check if the string contains '.de', '.fr', or any other country codes. I was trying to get the lang attribute of the html tag, but there are many websites that don't have it. Also I found here that I can check the meta tag, which would look like this: <meta name="language" content=

Find All text within 1 level in HTML using Beautiful Soup - Python

阅读更多关于 Find All text within 1 level in HTML using Beautiful Soup - Python

问题 I need to use beautiful soup to accomplish the following Example HTML <div id = "div1"> Text1 <div id="div2> Text2 <div id="div3"> Text3 </div> </div> </div> I need to do a search over this to return to me in separate instances of a list Text1 Text2 Text3 I tried doing a findAll('div'), but it repeated the same Text multiple times ie it would return Text1 Text2 Text3 Text2 Text3 Text3 回答1: Well, you problem is that .text also includes text from all the child nodes. You'll have to manually get

Extract absolute links from a page using HTMLParser

阅读更多关于 Extract absolute links from a page using HTMLParser

问题 I'm using the following snippet to extract all the links on a page using HTMLParser . I get quite a few relative URLs. How can I convert these to absolute URLs for a domain e.g. www.exmaple.com import htmllib, formatter import urllib, htmllib, formatter class LinksExtractor(htmllib.HTMLParser): def __init__(self, formatter): htmllib.HTMLParser.__init__(self, formatter) self.links = [] def start_a(self, attrs): if len(attrs) > 0 : for attr in attrs : if attr[0] == "href": self.links.append