HTML Agility Pack - using XPath to get a single node - Object Reference not set to an instance of an object

僤鯓⒐⒋嵵緔 提交于 2019-12-03 07:25:28
Simon Mourier

You can't rely on a developper tools such as FireBug or Chrome, etc... to determine the XPATH for the nodes you're after, as the XPATH given by such tools correspond to the in memory HTML DOM while the Html Agility Pack only knows about the raw HTML sent back by the server.

What you need to do is look visually at what's sent back (or just do a view source). You'll see there is no TBODY element for example. So you want to find anything discriminant, and use XPATH axes for example. Also, your XPATH, even if it worked, would not be very resistant to changes in the document, so you need to find something more "stable" for the scraping to be more future-proof.

Here is a code that seems to work:

HtmlNode node = doc.DocumentNode.SelectSingleNode("//td[@class='dnTableCell']//a[text()='High']/../../td[3]");

This is what it does:

  • find a TD element with a CLASS attribute set to 'dnTableCell'. The // token means the search is recursive in the XML hierarchy.
  • find an A element that contains a text (inner text) equals to 'High'.
  • navigate two parents up (we'll get to the closest TR element)
  • select the 3rd TD element from there

like Simon Mourier explaind, you obtained the raw HTML sent by the server. The element which you need has not been rendered yet therefor you can't retrieve it yet because it does not exist in the DOM. a simple work around to this problem is to use a web renderer to build the DOM, than you can grab the HTML and scrape it. I use WatiN like this:

WatiN.Core.Settings.MakeNewInstanceVisible = false;
WatiN.Core.Settings.AutoMoveMousePointerToTopLeft = false; 
IE ie = new IE();
ie.GoTo(urlLink); 
ie.WaitForComplete();
string html = ie.Html;
ie.close();
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!