InnerText=InnerHtml - How to extract readable text with HtmlAgilityPack

蓝咒 提交于 2019-12-13 19:50:30

问题


I need to extract text from a very bad Html.

I'm trying to do this using vb.net and HtmlAgilityPack

The tag that I need to parse has InnerText = InnerHtml and both:

Name:<!--b>&#61;</b--> Albert E<!--span-->instein  s<!--i>&#89;</i-->ection: 3 room: -

While debuging I can read it using "Html viewer": it shows:

Name: Albert Einstein section: 3 room: -

How can I get this into a string variable?

EDIT:

I use this code to get the node:

Dim ElePs As HtmlNodeCollection = _
    mWPage.DocumentNode.SelectNodes("//div[@id='div_main']//p")
For Each EleP As HtmlNode In ElePs
    'Here I need to get EleP.InnerText "normalized"
Next

回答1:


If you notice this mess is actually just html comments and they shall be ignored, so just getting the text and using string.Join is enough:

C#

var text = string.Join("",htmlDoc.DocumentNode.SelectNodes("//text()[normalize-space()]").
                                            Select(t=>t.InnerText));

VB.net

 Dim text = String.Join("", From t In htmlDoc.DocumentNode.SelectNodes("//text()[normalize-space()]")
                                   Select t.InnerText)

the html is valid, nothing bad about it, its just written by someone without a soul.

based on your update this shall do:

Dim ElePs As HtmlNodeCollection = mWPage.DocumentNode.SelectNodes("//div[@id='div_main']//p")
For Each EleP As HtmlNode In ElePs
    'Here I need to get EleP.InnerText "normalized"
     Dim text = String.Join("", From t In EleP.SelectNodes(".//text()[normalize-space()]")
                Select t.InnerText).Trim()
Next

note the .// it means that it will look for the descendant nodes of the current node unlike // which will always start from the top node.



来源:https://stackoverflow.com/questions/35744250/innertext-innerhtml-how-to-extract-readable-text-with-htmlagilitypack

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!