html-agility-pack

How to fix ill-formed HTML with HTML Agility Pack?

跟風遠走 提交于 2019-11-26 16:36:42
问题 I have this ill-formed HTML with overlapping tags: <p>word1<b>word2</p> <p>word3</b>word4</p> The overlapping can be nested, too. How can I convert it into well-formed HTML with HTML Agility Pack (HAP)? I'm looking for this output: <p>word1<b>word2</b></p> <p><b>word3</b>word4</p> I tried HtmlNode.ElementsFlags["b"] = HtmlElementFlag.Closed | HtmlElementFlag.CanOverlap , but it does not work as expected. 回答1: It is in fact working as expected, but maybe not working as you expected. Anyway,

HTML Agility Pack strip tags NOT IN whitelist

风流意气都作罢 提交于 2019-11-26 15:54:21
I'm trying to create a function which removes html tags and attributes which are not in a white list. I have the following HTML: <b>first text </b> <b>second text here <a>some text here</a> <a>some text here</a> </b> <a>some twxt here</a> I am using HTML agility pack and the code I have so far is: static List<string> WhiteNodeList = new List<string> { "b" }; static List<string> WhiteAttrList = new List<string> { }; static HtmlNode htmlNode; public static void RemoveNotInWhiteList(out string _output, HtmlNode pNode, List<string> pWhiteList, List<string> attrWhiteList) { // remove all attributes

How to get img/src or a/hrefs using Html Agility Pack?

你。 提交于 2019-11-26 14:37:01
问题 I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes. 回答1: The first example on the home page does something very similar, but consider: HtmlDocument doc = new HtmlDocument();

Image tag not closing with HTMLAgilityPack

落爺英雄遲暮 提交于 2019-11-26 14:24:10
问题 Using the HTMLAgilityPack to write out a new image node, it seems to remove the closing tag of an image, e.g. should be but when you check outer html, has . string strIMG = "<img src='" + imgPath + "' height='" + pubImg.Height + "px' width='" + pubImg.Width + "px' />"; HtmlNode newNode = HtmlNode.Create(strIMG); This breaks xhtml. 回答1: Telling it to output XML as Micky suggests works, but if you have other reasons not to want XML, try this: doc.OptionWriteEmptyNodes = true; 回答2: There is an

How to get all input elements in a form with HtmlAgilityPack without getting a null reference error

久未见 提交于 2019-11-26 14:12:43
问题 Example HTML: <html><body> <form id="form1"> <input name="foo1" value="bar1" /> <!-- Other elements --> </form> <form id="form2"> <input name="foo2" value="bar2" /> <!-- Other elements --> </form> </body></html> Test code: HtmlDocument doc = new HtmlDocument(); doc.Load(@"D:\test.html"); foreach (HtmlNode node in doc.GetElementbyId("form2").SelectNodes(".//input")) { Console.WriteLine(node.Attributes["value"].Value); } The statement doc.GetElementbyId("form2").SelectNodes(".//input") gives me

How can I use HTML Agility Pack to retrieve all the images from a website?

时间秒杀一切 提交于 2019-11-26 14:06:58
问题 I just downloaded the HTMLAgilityPack and the documentation doesn't have any examples. I'm looking for a way to download all the images from a website. The address strings, not the physical image. <img src="blabalbalbal.jpeg" /> I need to pull the source of each img tag. I just want to get a feel for the library and what it can offer. Everyone said this was the best tool for the job. Edit public void GetAllImages() { WebClient x = new WebClient(); string source = x.DownloadString(@"http://www

htmlagilitypack and dynamic content issue

两盒软妹~` 提交于 2019-11-26 13:08:23
问题 I want to create a web scrapper application and i want to do it with webbrowser control, htmlagilitypack and xpath. right now i managed to create xpath generator(I used webbrowser for this purpose), which works fine, but sometimes I cannot grab dynamically (via javascript or ajax) generated content. Also I found out that when webbrowser control(actually IE browser) generates some extra tags like \"tbody\", while again htmlagilitypack `htmlWeb.Load(webBrowser.DocumentStream);` doesn\'t see it.

HTML Agility pack - parsing tables

自闭症网瘾萝莉.ら 提交于 2019-11-26 12:42:13
I want to use the HTML agility pack to parse tables from complex web pages, but I am somehow lost in the object model. I looked at the link example, but did not find any table data this way. Can I use XPath to get the tables? I am basically lost after having loaded the data as to how to get the tables. I have done this in Perl before and it was a bit clumsy, but worked. ( HTML::TableParser ). I am also happy if one can just shed a light on the right object order for the parsing. Marc Gravell How about something like: Using HTML Agility Pack HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(@

HtmlAgilityPack Drops Option End Tags

為{幸葍}努か 提交于 2019-11-26 11:05:52
问题 I am using HtmlAgilityPack. I create an HtmlDocument and LoadHtml with the following string: <select id=\"foo_Bar\" name=\"foo.Bar\"><option selected=\"selected\" value=\"1\">One</option><option value=\"2\">Two</option></select> This does some unexpected things. First, it gives two parser errors, EndTagNotRequired. Second, the select node has 4 children - two for the option tags and two more for the inner text of the option tags. Last, the OuterHtml is like this: <select id=\"foo_Bar\" name=\

Selecting attribute values with html Agility Pack

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-26 08:30:06
问题 I\'m trying to retrieve a specific image from a html document, using html agility pack and this xpath: //div[@id=\'topslot\']/a/img/@src As far as I can see, it finds the src-attribute, but it returns the img-tag. Why is that? I would expect the InnerHtml/InnerText or something to be set, but both are empty strings. OuterHtml is set to the complete img-tag. Are there any documentation for Html Agility Pack? 回答1: Html Agility Pack does not support attribute selection. 回答2: You can directly