问题
I'm just trying to learn about HTMLAgilityPack and XPath, I'm attempting to get a list of (HTML Links) companies from the NASDAQ website;
http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx
I currently have the following code;
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// Create a request for the URL.
WebRequest request = WebRequest.Create("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx");
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// Get the stream containing content returned by the server.
Stream dataStream = response.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream);
// Read the content.
string responseFromServer = reader.ReadToEnd();
// Read into a HTML store read for HAP
htmlDoc.LoadHtml(responseFromServer);
HtmlNodeCollection tl = htmlDoc.DocumentNode.SelectNodes("//*[@id='indu_table']/tbody/tr[*]/td/b/a");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
Debug.Write(node.InnerText);
}
// Cleanup the streams and the response.
reader.Close();
dataStream.Close();
response.Close();
I've used an XPath addon for Chrome to get the XPath of;
//*table[@id='indu_table']/tbody/tr[*]/td/b/a
When running my project, I get an xpath unhandled exception about it being an invalid token.
I'm a little unsure what's wrong with it, i've tried to put a number in the tr[*] section above but i still get the same error.
I've been looking at this for the last hour, is it anything simple?
thanks
回答1:
Since the data comes from javascript you have to parse the javascript and not the html, so the Agility Pack doesn't help that much, but it makes things a bit easier. The following is how it could be done using Agility Pack and Newtonsoft JSON.Net to parse the Javascript.
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(new WebClient().OpenRead("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx"));
List<string> listStocks = new List<string>();
HtmlNode scriptNode = htmlDoc.DocumentNode.SelectSingleNode("//script[contains(text(),'var table_body =')]");
if (scriptNode != null)
{
//Using Regex here to get just the array we're interested in...
string stockArray = Regex.Match(scriptNode.InnerText, "table_body = (?<Array>\\[.+?\\]);").Groups["Array"].Value;
JArray jArray = JArray.Parse(stockArray);
foreach (JToken token in jArray.Children())
{
listStocks.Add("http://www.nasdaq.com/symbol/" + token.First.Value<string>().ToLower());
}
}
To explain a bit more in detail, the data comes from one big javascript array on the page var table_body = [...
.
Each stock is one element in the array and is an array itself.
["ATVI", "Activision Blizzard, Inc", 11.75, 0.06, 0.51, 3058125, 0.06, "N", "N"]
So by parsing the array and taking the first element and appending the fix url we get the same result as the javascript.
回答2:
Why won't you just use Descendants("a")
method?
It's much simplier and is more object oriented. You'll just get a bunch of objects.
The you can just get the "href" attribute from those objects.
Sample code:
htmlDoc.DocumentNode.Descendants("a").Attributes["href"].Value
If you just need list of links from certain webpage, this method will do just fine.
回答3:
If you look at the page source for that URL, there's not actually an element with id=indu_table
. It appears to be generated dynamically (i.e. in javascript); the html that you get when loading directly from the server will not reflect anything that's changed by client script. This is probably why it's not working.
来源:https://stackoverflow.com/questions/11017750/issue-with-htmlagilitypack-parsing-html-using-c-sharp