Parsing HTML page with HtmlAgilityPack using LINQ

问题

How can i parse html using Linq on a webpage and add values to a string. I am using the HtmlAgilityPack on a metro application and would like to bring back 3 values and add them to a string.

here is the url = http://explorer.litecoin.net/address/Li7x5UZqWUy7o1tEC2x5o6cNsn2bmDxA2N

I would like to get the values from the following see "belwo"

"Balance:", "Transactions in", "Received"

WebResponse x = await req.GetResponseAsync();
HttpWebResponse res = (HttpWebResponse)x;
if (res != null)
{
    if (res.StatusCode == HttpStatusCode.OK)
    {
        Stream stream = res.GetResponseStream();
        using (StreamReader reader = new StreamReader(stream))
        {
            html = reader.ReadToEnd();
        }
        HtmlDocument htmlDocument = new HtmlDocument();
        htmlDocument.LoadHtml(html);

        string appName = htmlDocument.DocumentNode.Descendants // not sure what t
        string a = "Name: " + WebUtility.HtmlDecode(appName);
    }
}

回答1:

Please try the following. You might also consider pulling the table apart as it is a little better formed than the free-text in the 'p' tag.

Cheers, Aaron.

// download the site content and create a new html document
// NOTE: make this asynchronous etc when considering IO performance
var url = "http://explorer.litecoin.net/address/Li7x5UZqWUy7o1tEC2x5o6cNsn2bmDxA2N";
var data = new WebClient().DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(data);

// extract the transactions 'h3' title, the node we want is directly before it
var transTitle = 
    (from h3 in doc.DocumentNode.Descendants("h3")
     where h3.InnerText.ToLower() == "transactions"
     select h3).FirstOrDefault();

// tokenise the summary, one line per 'br' element, split each line by the ':' symbol
var summary = transTitle.PreviousSibling.PreviousSibling;
var tokens = 
    (from row in summary.InnerHtml.Replace("<br>", "|").Split('|')
     where !string.IsNullOrEmpty(row.Trim())
     let line = row.Trim().Split(':')
     where line.Length == 2
     select new { name = line[0].Trim(), value = line[1].Trim() });

// using linqpad to debug, the dump command drops the currect variable to the output
tokens.Dump();

'Dump()', is a LinqPad command that dumps the variable to the console, the following is a sample of the output from the Dump command:

Balance: 5 LTC
Transactions in: 2
Received: 5 LTC
Transactions out: 0
Sent: 0 LTC

回答2:

the document you have to parse is not the most well formed for parsing many elements are missing the class or at least id attribute but what you want to get is a second p tag content in it

you can try this

HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);



var pNodes = htmlDocument.DocumentNode.SelectNodes("//p")
[1].InnerHtml.ToString().Split(new string[] { "<br />" }, StringSplitOptions.None).Take(3);

 string vl="Balance:"+pNodes[0].Split(':')[1]+"Transactions in"+pNodes[1].Split(':')[1]+"Received"+pNodes[2].Split(':')[1];

来源：https://stackoverflow.com/questions/20852762/parsing-html-page-with-htmlagilitypack-using-linq

标签

linq

html-agility-pack