html-parsing | 易学教程

Retrieve The link value form <a href> tag using php

阅读更多关于 Retrieve The link value form tag using php

问题 I need to extract the link value which is stored in a <a href> tag by using php code. <a href="http://stackoverflow.com/questions/ask"></a> From the above code i want to extract the link http://stackoverflow.com/questions/ask using php code. 回答1: There are a variety of options.. If you know the href will always be the one and only attribute on the a tag, you can find the position of the first and last double quotes using strpos/stripos and use substr to pull out the href. Alternatively, even

Repairing invalid HTML with Nokogiri (removing invalid tags)

阅读更多关于 Repairing invalid HTML with Nokogiri (removing invalid tags)

问题 I'm trying to tidy some retrieved HTML using the tidy-ext gem. However, it fails when the HTML is quite broken, so I'm trying to repair the HTML using Nokogiri first: repaired_html = Nokogiri::HTML.parse(a.raw_html).to_html It seems to do a nice job but lately I encountered a sample where people inserted FBML markup into the HTML document such as <fb:like> which is somehow preserved by Nokogiri although being invalid. Tidy then says Error: <fb:like> is not recognized! which is understandable.

Parse html using Perl

阅读更多关于 Parse html using Perl

问题 I have the following HTML- <div> <strong>Date: </strong> 19 July 2011 </div> I have been using HTML::TreeBuilder to parse out particular parts of html that are using either tags or classes however the aforementioned html is giving me difficulty in trying to extract the date only. For instance I tried- for ( $tree->look_down( '_tag' => 'div')) { my $date = $_->look_down( '_tag' => 'strong' )->as_trimmed_text; But that seems to conflict with an earlier use of <strong>. I am looking to parse out

Multiclass element selection clarification [duplicate]

阅读更多关于 Multiclass element selection clarification [duplicate]

问题 This question already has an answer here : Jsoup div[class=] syntax works whereas div.class syntax doesn't - Why? (1 answer) Closed 5 years ago . Assuming several multiclass divs as demonstrated in the following HTML: <div class="class_one class_two class_three classfour classfive classsix"> <div class="class_one class_two class_three classfour classfive"> <div class="class_one class_two class_three classfour classsix"> Is there a single Jsoup select expression that will select all 3 of them?

Using BeautifulSoup on very large HTML file - memory error?

阅读更多关于 Using BeautifulSoup on very large HTML file - memory error?

问题 I'm learning Python by working on a project - a Facebook message analyzer. I downloaded my data, which includes a messages.htm file of all my messages. I'm trying to write a program to parse this file and output data (# of messages, most common words, etc.) However, my messages.htm file is 270MB. When creating a BeautifulSoup object in the shell for testing, any other file (all < 1MB) works just fine. But I can't create a bs object of messages.htm. Here's the error: >>> mf = open('messages

BeautifulSoup doesn't find correctly parsed elements

阅读更多关于 BeautifulSoup doesn't find correctly parsed elements

问题 I am using BeautifulSoup to parse a bunch of possibly very dirty HTML documents. I stumbled upon a very bizarre thing. The HTML comes from this page: http://www.wvdnr.gov/ It contains multiple errors, like multiple <html></html> , <title> outside the <head> , etc... However, html5lib usually works well even in these cases. In fact, when I do: soup = BeautifulSoup(document, "html5lib") and I pretti-print soup , I see the following output: http://pastebin.com/8BKapx88 which contains a lot of <a

How do I determine if there are two or one numbers at the start of my string?

阅读更多关于 How do I determine if there are two or one numbers at the start of my string?

问题 How can I determine what number (with an arbitrary number of digits) is at the start of a string? Some possible strings: 1123|http://example.com 2|daas Which should return 1123 and 2. 回答1: You can use LINQ: string s = "35|..."; int result = int.Parse(new string(s.TakeWhile(char.IsDigit).ToArray())); or (if the number is always followed by a | ) good ol' string manipulation: string s = "35|..."; int result = int.Parse(s.Substring(0, s.IndexOf('|'))); 回答2: Use a regular expression: using System

Parsing HTML using Xpath with Javascript

阅读更多关于 Parsing HTML using Xpath with Javascript

问题 In .NET there is a lovely library that allows me to easily parse an external html page using xpath queries (HTML Agility Project) - the problem is I have to do that client-side, so only javascript. Is there any way to do that? 回答1: jQuery also supports xPath selector as well CSS, you can get more information from the link below. http://docs.jquery.com/DOM/Traversing/Selectors 回答2: You can try it https://github.com/andrejpavlovic/xpathjs Actually there are a lot of it and there is an window

How to modify an html tree in python?

阅读更多关于 How to modify an html tree in python?

问题 Suppose there is some variable fragment html code <p> <span class="code"> string 1 </ span> <span class="code"> string 2 </ span> <span class="code"> string 3 </ span> </ p> <p> <span class="any"> Some text </ span> </ p> I need to modify the contents of all the tags with the class code <span> skipping content through some function, such as foo , which returns the contents of the modified tag <span> . Ultimately, I should get a new piece of html document like this: <p> <span class="code">

Get div content by id

阅读更多关于 Get div content by id

问题 I have a string that keeps entire html document. I would like to get all the content inside a div with specific id. For example: <div id="myId" class = "myClass"> <div class = "myClass">hello</div> </div> I need the content between the tag with id="myId" and it's closing tag. Any way to achieve this? The output should be the second line. 回答1: The clean and correct way would be via an HTML parser, like HtmlAgilityPack: string stringThatKeepsYourHtml = "<div id=...."; HtmlDocument doc = new