Parsing html using agility pack

混江龙づ霸主 提交于 2019-12-10 17:48:24

问题


I have a html to parse(see below)

<div id="mailbox" class="div-w div-m-0">
    <h2 class="h-line">InBox</h2>
    <div id="mailbox-table">
        <table id="maillist">
            <tr>
                <th>From</th>
                <th>Subject</th>
                <th>Date</th>
            </tr>
            <tr onclick="location='readmail.html?mid=welcome'" style="font-weight: bold;">
                <td>no-reply@somemail.net</td>
                <td>
                    <a href="readmail.html?mid=welcome">Hi, Welcome</a>
                </td>
                <td>
                    <span title="2016-02-16 13:23:50 UTC">just now</span>
                </td>
            </tr>
            <tr onclick="location='readmail.html?mid=T0wM6P'" style="font-weight: bold;">
                <td>someone@outlook.com</td>
                <td>
                    <a href="readmail.html?mid=T0wM6P">sa</a>
                </td>
                <td>
                    <span title="2016-02-16 13:24:04">just now</span>
                </td>
            </tr>
        </table>
    </div>
</div>

I need to parse links in <tr onclick= tags and email addresses in <td> tags.

So far i manged to get first occurance of email/link from my html.

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(responseFromServer);

Could someone show me how is it properly done? Basically what i want to do is take all email addresses and links from html that are in said tags.

foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//tr[@onclick]"))
{
    HtmlAttribute att = link.Attributes["onclick"];
    Console.WriteLine(att.Value);
}

EDIT: I need to store parsed values in a class (list) in pairs. Email (link) and senders Email.

public class ClassMailBox
{
    public string From { get; set; } 
    public string LinkToMail { get; set; }    

}

回答1:


You can write the following code:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(responseFromServer);

foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//tr[@onclick]"))
{
    HtmlAttribute att = link.Attributes["onclick"];
    ClassMailBox classMailbox = new ClassMailBox() { LinkToMail = att.Value };
    classMailBoxes.Add(classMailbox);
}

int currentPosition = 0;

foreach (HtmlNode tableDef in doc.DocumentNode.SelectNodes("//tr[@onclick]/td[1]"))
{
    classMailBoxes[currentPosition].From = tableDef.InnerText;
    currentPosition++;
}

To keep this code simple, I'm assuming some things:

  1. The email is always on the first td inside the tr which contains an onlink property
  2. Every tr with an onlink attribute contains an email

If those conditions don't apply this code won't work and it could throw some exceptions (IndexOutOfRangeExceptions) or it could match links with wrong email addresses.



来源:https://stackoverflow.com/questions/35434519/parsing-html-using-agility-pack

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!