Parsing html using agility pack

问题

I have a html to parse(see below)

<div id="mailbox" class="div-w div-m-0">
    <h2 class="h-line">InBox</h2>
    <div id="mailbox-table">
        <table id="maillist">
            <tr>
                <th>From</th>
                <th>Subject</th>
                <th>Date</th>
            </tr>
            <tr onclick="location='readmail.html?mid=welcome'" style="font-weight: bold;">
                <td>no-reply@somemail.net</td>
                <td>
                    <a href="readmail.html?mid=welcome">Hi, Welcome</a>
                </td>
                <td>
                    <span title="2016-02-16 13:23:50 UTC">just now</span>
                </td>
            </tr>
            <tr onclick="location='readmail.html?mid=T0wM6P'" style="font-weight: bold;">
                <td>someone@outlook.com</td>
                <td>
                    <a href="readmail.html?mid=T0wM6P">sa</a>
                </td>
                <td>
                    <span title="2016-02-16 13:24:04">just now</span>
                </td>
            </tr>
        </table>
    </div>
</div>

I need to parse links in <tr onclick= tags and email addresses in <td> tags.

So far i manged to get first occurance of email/link from my html.

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(responseFromServer);

Could someone show me how is it properly done? Basically what i want to do is take all email addresses and links from html that are in said tags.

foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//tr[@onclick]"))
{
    HtmlAttribute att = link.Attributes["onclick"];
    Console.WriteLine(att.Value);
}

EDIT: I need to store parsed values in a class (list) in pairs. Email (link) and senders Email.

public class ClassMailBox
{
    public string From { get; set; } 
    public string LinkToMail { get; set; }    

}

回答1:

You can write the following code:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(responseFromServer);

foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//tr[@onclick]"))
{
    HtmlAttribute att = link.Attributes["onclick"];
    ClassMailBox classMailbox = new ClassMailBox() { LinkToMail = att.Value };
    classMailBoxes.Add(classMailbox);
}

int currentPosition = 0;

foreach (HtmlNode tableDef in doc.DocumentNode.SelectNodes("//tr[@onclick]/td[1]"))
{
    classMailBoxes[currentPosition].From = tableDef.InnerText;
    currentPosition++;
}

To keep this code simple, I'm assuming some things:

The email is always on the first td inside the tr which contains an onlink property
Every tr with an onlink attribute contains an email

If those conditions don't apply this code won't work and it could throw some exceptions (IndexOutOfRangeExceptions) or it could match links with wrong email addresses.

来源：https://stackoverflow.com/questions/35434519/parsing-html-using-agility-pack

标签

html

parsing

html-agility-pack