Issue with HTMLAgilityPack parsing HTML using C#

问题

I'm just trying to learn about HTMLAgilityPack and XPath, I'm attempting to get a list of (HTML Links) companies from the NASDAQ website;

http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx

I currently have the following code;

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

        // Create a request for the URL.        
        WebRequest request = WebRequest.Create("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx");
        // Get the response.
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        // Get the stream containing content returned by the server.
        Stream dataStream = response.GetResponseStream();
        // Open the stream using a StreamReader for easy access.
        StreamReader reader = new StreamReader(dataStream);
        // Read the content.
        string responseFromServer = reader.ReadToEnd();
        // Read into a HTML store read for HAP
        htmlDoc.LoadHtml(responseFromServer);

        HtmlNodeCollection tl = htmlDoc.DocumentNode.SelectNodes("//*[@id='indu_table']/tbody/tr[*]/td/b/a");
        foreach (HtmlAgilityPack.HtmlNode node in tl)
        {
            Debug.Write(node.InnerText);
        }            

        // Cleanup the streams and the response.
        reader.Close();
        dataStream.Close();
        response.Close();

I've used an XPath addon for Chrome to get the XPath of;

//*table[@id='indu_table']/tbody/tr[*]/td/b/a

When running my project, I get an xpath unhandled exception about it being an invalid token.

I'm a little unsure what's wrong with it, i've tried to put a number in the tr[*] section above but i still get the same error.

I've been looking at this for the last hour, is it anything simple?

thanks

回答1:

Since the data comes from javascript you have to parse the javascript and not the html, so the Agility Pack doesn't help that much, but it makes things a bit easier. The following is how it could be done using Agility Pack and Newtonsoft JSON.Net to parse the Javascript.

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(new WebClient().OpenRead("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx"));
List<string> listStocks = new List<string>();
HtmlNode scriptNode = htmlDoc.DocumentNode.SelectSingleNode("//script[contains(text(),'var table_body =')]");
if (scriptNode != null)
{
  //Using Regex here to get just the array we're interested in...
  string stockArray = Regex.Match(scriptNode.InnerText, "table_body = (?<Array>\\[.+?\\]);").Groups["Array"].Value;
  JArray jArray = JArray.Parse(stockArray);
  foreach (JToken token in jArray.Children())
  {
    listStocks.Add("http://www.nasdaq.com/symbol/" + token.First.Value<string>().ToLower());
  }
}

To explain a bit more in detail, the data comes from one big javascript array on the page var table_body = [.... Each stock is one element in the array and is an array itself.

["ATVI", "Activision Blizzard, Inc", 11.75, 0.06, 0.51, 3058125, 0.06, "N", "N"]

So by parsing the array and taking the first element and appending the fix url we get the same result as the javascript.

回答2:

Why won't you just use Descendants("a") method? It's much simplier and is more object oriented. You'll just get a bunch of objects. The you can just get the "href" attribute from those objects.

Sample code:

htmlDoc.DocumentNode.Descendants("a").Attributes["href"].Value

If you just need list of links from certain webpage, this method will do just fine.

回答3:

If you look at the page source for that URL, there's not actually an element with id=indu_table. It appears to be generated dynamically (i.e. in javascript); the html that you get when loading directly from the server will not reflect anything that's changed by client script. This is probably why it's not working.

来源：https://stackoverflow.com/questions/11017750/issue-with-htmlagilitypack-parsing-html-using-c-sharp

标签

xpath

html-agility-pack