Parsing HTML page with HtmlAgilityPack using LINQ

僤鯓⒐⒋嵵緔 提交于 2019-12-12 10:16:40

问题


How can i parse html using Linq on a webpage and add values to a string. I am using the HtmlAgilityPack on a metro application and would like to bring back 3 values and add them to a string.

here is the url = http://explorer.litecoin.net/address/Li7x5UZqWUy7o1tEC2x5o6cNsn2bmDxA2N

I would like to get the values from the following see "belwo"

"Balance:", "Transactions in", "Received"

WebResponse x = await req.GetResponseAsync();
HttpWebResponse res = (HttpWebResponse)x;
if (res != null)
{
    if (res.StatusCode == HttpStatusCode.OK)
    {
        Stream stream = res.GetResponseStream();
        using (StreamReader reader = new StreamReader(stream))
        {
            html = reader.ReadToEnd();
        }
        HtmlDocument htmlDocument = new HtmlDocument();
        htmlDocument.LoadHtml(html);

        string appName = htmlDocument.DocumentNode.Descendants // not sure what t
        string a = "Name: " + WebUtility.HtmlDecode(appName);
    }
}

回答1:


Please try the following. You might also consider pulling the table apart as it is a little better formed than the free-text in the 'p' tag.

Cheers, Aaron.

// download the site content and create a new html document
// NOTE: make this asynchronous etc when considering IO performance
var url = "http://explorer.litecoin.net/address/Li7x5UZqWUy7o1tEC2x5o6cNsn2bmDxA2N";
var data = new WebClient().DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(data);

// extract the transactions 'h3' title, the node we want is directly before it
var transTitle = 
    (from h3 in doc.DocumentNode.Descendants("h3")
     where h3.InnerText.ToLower() == "transactions"
     select h3).FirstOrDefault();

// tokenise the summary, one line per 'br' element, split each line by the ':' symbol
var summary = transTitle.PreviousSibling.PreviousSibling;
var tokens = 
    (from row in summary.InnerHtml.Replace("<br>", "|").Split('|')
     where !string.IsNullOrEmpty(row.Trim())
     let line = row.Trim().Split(':')
     where line.Length == 2
     select new { name = line[0].Trim(), value = line[1].Trim() });

// using linqpad to debug, the dump command drops the currect variable to the output
tokens.Dump();

'Dump()', is a LinqPad command that dumps the variable to the console, the following is a sample of the output from the Dump command:

  • Balance: 5 LTC
  • Transactions in: 2
  • Received: 5 LTC
  • Transactions out: 0
  • Sent: 0 LTC



回答2:


the document you have to parse is not the most well formed for parsing many elements are missing the class or at least id attribute but what you want to get is a second p tag content in it

you can try this

HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);



var pNodes = htmlDocument.DocumentNode.SelectNodes("//p")
[1].InnerHtml.ToString().Split(new string[] { "<br />" }, StringSplitOptions.None).Take(3);

 string vl="Balance:"+pNodes[0].Split(':')[1]+"Transactions in"+pNodes[1].Split(':')[1]+"Received"+pNodes[2].Split(':')[1];


来源:https://stackoverflow.com/questions/20852762/parsing-html-page-with-htmlagilitypack-using-linq

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!