HTML Agility pack removes break tag close

房东的猫 提交于 2019-12-03 23:10:43

It happens because the Html Agility Pack handles the BR in a special way. It still supports old (but existing on the web today) HTML 3.2 syntax where the BR could be declared without a closing tag at all (browsers also still handle it gracefully by the way...).

To change this default behavior, you need to modify the HtmlNode.ElementFlags property, like this:

Dim doc As New HtmlDocument()
HtmlNode.ElementsFlags("br") = HtmlElementFlag.Empty
doc.LoadHtml("<test>before<br/>after</test>")
doc.OptionWriteEmptyNodes = True   
doc.Save(Console.Out)

which will display:

<test>before<br />after</test>

As per @Simon Mourier, the following C# code works in version 1.4

var doc = new HtmlDocument();
HtmlNode.ElementsFlags["br"] = HtmlElementFlag.Empty;
doc.OptionWriteEmptyNodes = true;
doc.LoadHtml("Lorem ipsum dolor sit<br/>Lorem ipsum dolor sit");

var postParsed = doc.DocumentNode.WriteTo();

has the following string value for postParsed

"Lorem ipsum dolor sit<br />Lorem ipsum dolor sit"

Seems this is a standard setting in Html Agility Pack. By default, it does not conform to XHTML and many tags are not closed.

There are 2 ways to do this. At the document level you can do the following which will turn on ALL closing tags. (This is my preferred method).

HtmlDocument doc = new HtmlDocument();
doc.OptionWriteEmptyNodes = true;
doc.LoadHtml(content);

However, this may not be desirable. There is another way to do it at the node level.

if (HtmlNode.ElementsFlags.ContainsKey("img"))
{
    HtmlNode.ElementsFlags["img"] = HtmlElementFlag.Closed;
}
else
{
    HtmlNode.ElementsFlags.Add("img", HtmlElementFlag.Closed);
}

I have encountered same kind of problem and I solved it by manually re-parsing HTML chunk using new HtmlDocument object with correct settings.

Problem as I see it is that HtmlDocument has all those nice settings to let you close
tags etc, but when you select a node or do some other soft of operation with nodes and use their OuterHtml or InnerHtml some of those closing tags are lost (probably because those properties do not use same settings as document itself, or meybe there is some other reason). So when you get that incorrect html string from InnerHtml or OuterHtml, you can just re-parse it with HtmlDocument again and use document.DocumentElement.InnerHtml to get correct HTML string.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!