Extracting Inner text from HTML BODY node with Html Agility Pack

北战南征 提交于 2019-12-01 15:18:10

问题


Need a bit of help with HTML Agility Pack!

Basically I want to grab plain-text withing the body node of the HTML. So far I have tried this in vb.net and it fails to return the innertext meaning no change is seen, well atleast from what I can see.

Dim htmldoc As HtmlDocument = New HtmlDocument
htmldoc.LoadHtml(html)

Dim paragraph As HtmlNodeCollection = htmldoc.DocumentNode.SelectNodes("//body")

If Not htmldoc Is Nothing Then
   For Each node In paragraph
       node.ParentNode.RemoveChild(node, True)
   Next
End If

Return htmldoc.DocumentNode.WriteContentTo

I have tried this:

Return htmldoc.DocumentNode.InnerText

But still no luck!

Any advice???


回答1:


How about:

Return htmldoc.DocumentNode.SelectSingleNode("//body").InnerText



回答2:


Jeff's solution is ok if you haven't tables, because text located in the table is sticking like cell1cell2cell3. To prevent this issue use this code (C# example):

var words = doc.DocumentNode?.SelectNodes("//body//text()")?.Select(x => x.InnerText);
return words != null ? string.Join(" ", words) : String.Empty;


来源:https://stackoverflow.com/questions/6852165/extracting-inner-text-from-html-body-node-with-html-agility-pack

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!