How do I remove whitespace in HTML Source with Html Agility Pack and C#

佐手、 提交于 2019-12-11 00:15:23

问题


Before posting I tried the solution from this thread:

C# - Remove spaces in HTML source in between markups?

Here is a snippet of the HTML I'm working with:

<p>This is my text</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>This is next text</p>

I'm using HTML Agility Pack to clean up the HTML:

HtmlDocument doc = new HtmlDocument();
doc.Load(htmlLocation);
foreach (var item in doc.DocumentNode.Descendants("p").ToList())
{
    if (item.InnerHtml == "&nbsp;")
    {
        item.Remove();
    }
}

The output of the code above is

<p>This is my text</p>





<p>This is next text</p>

So my problem is how do I remove the extra whitespace between the two paragraphs in the HTML source.


回答1:


Remove the text nodes between the first and last paragraphs:

HTML:

var html = @"
<p>This is my text</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>This is next text</p>";

Parse it:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var paragraphs = doc.DocumentNode.Descendants("p").ToList();
foreach (var item in paragraphs)
{
    if (item.InnerHtml == "&nbsp;") item.Remove();
}
var followingText = paragraphs[0]
    .SelectNodes(".//following-sibling::text()")
    .ToList();
foreach (var text in followingText) 
{
    text.Remove();
}

Result:

<p>This is my text</p><p>This is next text</p>

If you want to keep the line break between the paragraphs, use a for loop and call Remove() on all except the last text node.

for (int i = 0; i < followingText.Count - 1; ++i)
{
    followingText[i].Remove();
}

Result:

<p>This is my text</p>
<p>This is next text</p>


来源:https://stackoverflow.com/questions/43175880/how-do-i-remove-whitespace-in-html-source-with-html-agility-pack-and-c-sharp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!