html-agility-pack | 易学教程

How to fix ill-formed HTML with HTML Agility Pack?

阅读更多关于 How to fix ill-formed HTML with HTML Agility Pack?

问题 I have this ill-formed HTML with overlapping tags: <p>word1<b>word2</p> <p>word3</b>word4</p> The overlapping can be nested, too. How can I convert it into well-formed HTML with HTML Agility Pack (HAP)? I'm looking for this output: <p>word1<b>word2</b></p> <p><b>word3</b>word4</p> I tried HtmlNode.ElementsFlags["b"] = HtmlElementFlag.Closed | HtmlElementFlag.CanOverlap , but it does not work as expected. 回答1: It is in fact working as expected, but maybe not working as you expected. Anyway,

HTML Agility Pack strip tags NOT IN whitelist

阅读更多关于 HTML Agility Pack strip tags NOT IN whitelist

I'm trying to create a function which removes html tags and attributes which are not in a white list. I have the following HTML: <b>first text </b> <b>second text here <a>some text here</a> <a>some text here</a> </b> <a>some twxt here</a> I am using HTML agility pack and the code I have so far is: static List<string> WhiteNodeList = new List<string> { "b" }; static List<string> WhiteAttrList = new List<string> { }; static HtmlNode htmlNode; public static void RemoveNotInWhiteList(out string _output, HtmlNode pNode, List<string> pWhiteList, List<string> attrWhiteList) { // remove all attributes

How to get img/src or a/hrefs using Html Agility Pack?

阅读更多关于 How to get img/src or a/hrefs using Html Agility Pack?

问题 I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes. 回答1: The first example on the home page does something very similar, but consider: HtmlDocument doc = new HtmlDocument();

Image tag not closing with HTMLAgilityPack

阅读更多关于 Image tag not closing with HTMLAgilityPack

问题 Using the HTMLAgilityPack to write out a new image node, it seems to remove the closing tag of an image, e.g. should be but when you check outer html, has . string strIMG = "<img src='" + imgPath + "' height='" + pubImg.Height + "px' width='" + pubImg.Width + "px' />"; HtmlNode newNode = HtmlNode.Create(strIMG); This breaks xhtml. 回答1: Telling it to output XML as Micky suggests works, but if you have other reasons not to want XML, try this: doc.OptionWriteEmptyNodes = true; 回答2: There is an

How to get all input elements in a form with HtmlAgilityPack without getting a null reference error

阅读更多关于 How to get all input elements in a form with HtmlAgilityPack without getting a null reference error

问题 Example HTML: <html><body> <form id="form1"> <input name="foo1" value="bar1" />  </form> <form id="form2"> <input name="foo2" value="bar2" />  </form> </body></html> Test code: HtmlDocument doc = new HtmlDocument(); doc.Load(@"D:\test.html"); foreach (HtmlNode node in doc.GetElementbyId("form2").SelectNodes(".//input")) { Console.WriteLine(node.Attributes["value"].Value); } The statement doc.GetElementbyId("form2").SelectNodes(".//input") gives me

How can I use HTML Agility Pack to retrieve all the images from a website?

阅读更多关于 How can I use HTML Agility Pack to retrieve all the images from a website?

问题 I just downloaded the HTMLAgilityPack and the documentation doesn't have any examples. I'm looking for a way to download all the images from a website. The address strings, not the physical image. <img src="blabalbalbal.jpeg" /> I need to pull the source of each img tag. I just want to get a feel for the library and what it can offer. Everyone said this was the best tool for the job. Edit public void GetAllImages() { WebClient x = new WebClient(); string source = x.DownloadString(@"http://www

htmlagilitypack and dynamic content issue

阅读更多关于 htmlagilitypack and dynamic content issue

问题 I want to create a web scrapper application and i want to do it with webbrowser control, htmlagilitypack and xpath. right now i managed to create xpath generator(I used webbrowser for this purpose), which works fine, but sometimes I cannot grab dynamically (via javascript or ajax) generated content. Also I found out that when webbrowser control(actually IE browser) generates some extra tags like \"tbody\", while again htmlagilitypack `htmlWeb.Load(webBrowser.DocumentStream);` doesn\'t see it.

HTML Agility pack - parsing tables

阅读更多关于 HTML Agility pack - parsing tables

I want to use the HTML agility pack to parse tables from complex web pages, but I am somehow lost in the object model. I looked at the link example, but did not find any table data this way. Can I use XPath to get the tables? I am basically lost after having loaded the data as to how to get the tables. I have done this in Perl before and it was a bit clumsy, but worked. ( HTML::TableParser ). I am also happy if one can just shed a light on the right object order for the parsing. Marc Gravell How about something like: Using HTML Agility Pack HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(@

HtmlAgilityPack Drops Option End Tags

阅读更多关于 HtmlAgilityPack Drops Option End Tags

问题 I am using HtmlAgilityPack. I create an HtmlDocument and LoadHtml with the following string: <select id=\"foo_Bar\" name=\"foo.Bar\"><option selected=\"selected\" value=\"1\">One</option><option value=\"2\">Two</option></select> This does some unexpected things. First, it gives two parser errors, EndTagNotRequired. Second, the select node has 4 children - two for the option tags and two more for the inner text of the option tags. Last, the OuterHtml is like this: <select id=\"foo_Bar\" name=\

Selecting attribute values with html Agility Pack

阅读更多关于 Selecting attribute values with html Agility Pack

问题 I\'m trying to retrieve a specific image from a html document, using html agility pack and this xpath: //div[@id=\'topslot\']/a/img/@src As far as I can see, it finds the src-attribute, but it returns the img-tag. Why is that? I would expect the InnerHtml/InnerText or something to be set, but both are empty strings. OuterHtml is set to the complete img-tag. Are there any documentation for Html Agility Pack? 回答1: Html Agility Pack does not support attribute selection. 回答2: You can directly