Using Html Agility Pack to parse nodes in a context sensitive fashion

问题

<div class="mvb"><b>Date 1</b></div>
<div class="mxb"><b>Header 1</b></div>
<div>
   inner hmtl 1
</div>

<div class="mvb"><b>Date 2</b></div>
<div class="mxb"><b>Header 2</b></div>
<div>
inner html 2
</div>

I would like to parse the inner html between the tags in such a way that I can

* associate the inner html 1 with header 1 and date 1

* associate the inner html 2 with header 2 and date 2

In other words, at the time I parse the inner html 1 I would like to know that the html nodes containing "Date 1" and "Header 1" have been parsed (but the nodes containing "Date 2" and "Header 2" have not been parsed)

If I were doing this via regular text parsing, I would read one line at a time and record the last "Date" and "Header" than I had parsed. Then when it came time to parse the inner html 1, I could refer to the last parsed "Date" and "Header" object to associate them together.

回答1:

Using the Html Agility Pack, you can leverage XPATH power - and forget about that verbose xlinq crap :-). The XPATH position() function is context sensitive. Here is a sample code:

    HtmlDocument doc = new HtmlDocument();
    doc.Load("your html file");

    // select all DIV without a CLASS attribute defined
    foreach (HtmlNode div in doc.DocumentNode.SelectNodes("//div[not(@class)]"))
    {
        Console.WriteLine("div=" + div.InnerText.Trim());
        Console.WriteLine("  header=" + div.SelectSingleNode("preceding-sibling::div[position()=1]/b").InnerText);
        Console.WriteLine("  date=" + div.SelectSingleNode("preceding-sibling::div[position()=2]/b").InnerText);
    }

That will prrint this with your sample:

div=inner hmtl 1
  header=Header 1
  date=Date 1
div=inner html 2
  header=Header 2
  date=Date 2

回答2:

Well, you can do this in several ways...

For example, if the HTML you want to parse is the one you wrote in your question, an easy way could be:

Store all dates in a HtmlNodeCollection
Store all headers in a HtmlNodeCollection
Store all inner texts in another HtmlNodeCollection

If everything is okay and the HTML has that layout, you will have the same number of elements in both 3 collections.

Then you can easily do:

for (int i = 0; i < innerTexts.Count; i++) {
    //Get Date, Headers and Inner Texts at position i
}

The following should work:

var document = new HtmlWeb().Load("http://www.url.com"); //Or load it from a Stream, local file, etc.

var dateNodes = document.DocumentNode.SelectNodes("//div[@class='mvb']/b");
var headerNodes = document.DocumentNode.SelectNodes("//div[@class='mxb']/b");

var innerTextNodes = (from node in document.DocumentNode.SelectNodes("//div")
                        let previous = node.PreviousSibling
                        where previous.Name == "div" && previous.GetAttributeValue("class", "") == "mxb"
                        select node).ToList();

//Check here if the number of elements of the 3 collections are the same

for (int i = 0; i < dateNodes.Count; i++) {
    var date = dateNodes[i].InnerText;
    var header = headerNodes[i].InnerText;
    var innerText = innerTextNodes[i].InnerText;

    //Now you have the set you want: You have the Date, Header and Inner Text
}

This is a way of doing this. Of course, you should check for exceptions (that .SelectNodes(..) method are not returning null), check for errors in the LINQ expression when storing innerTextNodes, and refactor the for (...), maybe into a method that receives a HtmlNode and returns the InnerText property of it.

Take in count that the only way you can know, in the HTML code you posted, what is the <div> tag that contains the Inner Text, is to assume it is the one that is next to the <div> tag that contains the Header. That's why I used the LINQ expression.

Another way of knowing it could be if the <div> has some particular attribute (like class="___") or similar, or if it contains some tags inside it and not just text. There is no magic when parsing HTMLs :)

Edit:
I have not tested this code. Test it by yourself and let me know if it worked.

来源：https://stackoverflow.com/questions/5609141/using-html-agility-pack-to-parse-nodes-in-a-context-sensitive-fashion

标签

html-agility-pack