Using Html Agility Pack to parse nodes in a context sensitive fashion

戏子无情 提交于 2019-12-11 12:52:46

问题


<div class="mvb"><b>Date 1</b></div>
<div class="mxb"><b>Header 1</b></div>
<div>
   inner hmtl 1
</div>

<div class="mvb"><b>Date 2</b></div>
<div class="mxb"><b>Header 2</b></div>
<div>
inner html 2
</div>

I would like to parse the inner html between the tags in such a way that I can

    * associate the inner html 1 with header 1 and date 1
    * associate the inner html 2 with header 2 and date 2

In other words, at the time I parse the inner html 1 I would like to know that the html nodes containing "Date 1" and "Header 1" have been parsed (but the nodes containing "Date 2" and "Header 2" have not been parsed)

If I were doing this via regular text parsing, I would read one line at a time and record the last "Date" and "Header" than I had parsed. Then when it came time to parse the inner html 1, I could refer to the last parsed "Date" and "Header" object to associate them together.


回答1:


Using the Html Agility Pack, you can leverage XPATH power - and forget about that verbose xlinq crap :-). The XPATH position() function is context sensitive. Here is a sample code:

    HtmlDocument doc = new HtmlDocument();
    doc.Load("your html file");

    // select all DIV without a CLASS attribute defined
    foreach (HtmlNode div in doc.DocumentNode.SelectNodes("//div[not(@class)]"))
    {
        Console.WriteLine("div=" + div.InnerText.Trim());
        Console.WriteLine("  header=" + div.SelectSingleNode("preceding-sibling::div[position()=1]/b").InnerText);
        Console.WriteLine("  date=" + div.SelectSingleNode("preceding-sibling::div[position()=2]/b").InnerText);
    }

That will prrint this with your sample:

div=inner hmtl 1
  header=Header 1
  date=Date 1
div=inner html 2
  header=Header 2
  date=Date 2



回答2:


Well, you can do this in several ways...

For example, if the HTML you want to parse is the one you wrote in your question, an easy way could be:

  1. Store all dates in a HtmlNodeCollection
  2. Store all headers in a HtmlNodeCollection
  3. Store all inner texts in another HtmlNodeCollection

If everything is okay and the HTML has that layout, you will have the same number of elements in both 3 collections.

Then you can easily do:

for (int i = 0; i < innerTexts.Count; i++) {
    //Get Date, Headers and Inner Texts at position i
}

The following should work:

var document = new HtmlWeb().Load("http://www.url.com"); //Or load it from a Stream, local file, etc.

var dateNodes = document.DocumentNode.SelectNodes("//div[@class='mvb']/b");
var headerNodes = document.DocumentNode.SelectNodes("//div[@class='mxb']/b");

var innerTextNodes = (from node in document.DocumentNode.SelectNodes("//div")
                        let previous = node.PreviousSibling
                        where previous.Name == "div" && previous.GetAttributeValue("class", "") == "mxb"
                        select node).ToList();

//Check here if the number of elements of the 3 collections are the same

for (int i = 0; i < dateNodes.Count; i++) {
    var date = dateNodes[i].InnerText;
    var header = headerNodes[i].InnerText;
    var innerText = innerTextNodes[i].InnerText;

    //Now you have the set you want: You have the Date, Header and Inner Text
}

This is a way of doing this. Of course, you should check for exceptions (that .SelectNodes(..) method are not returning null), check for errors in the LINQ expression when storing innerTextNodes, and refactor the for (...), maybe into a method that receives a HtmlNode and returns the InnerText property of it.

Take in count that the only way you can know, in the HTML code you posted, what is the <div> tag that contains the Inner Text, is to assume it is the one that is next to the <div> tag that contains the Header. That's why I used the LINQ expression.

Another way of knowing it could be if the <div> has some particular attribute (like class="___") or similar, or if it contains some tags inside it and not just text. There is no magic when parsing HTMLs :)

Edit:
I have not tested this code. Test it by yourself and let me know if it worked.



来源:https://stackoverflow.com/questions/5609141/using-html-agility-pack-to-parse-nodes-in-a-context-sensitive-fashion

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!