html-agility-pack

HtmlAgilityPack - Grab data from html table

穿精又带淫゛_ 提交于 2019-12-02 02:11:09
My program uses HtmlAgilityPack and grabs a HTML web page, stores it in a variable and I'm trying to get from the HTML two tables which are under specific Div Class tags (boardcontainer). With my current code it searches through the whole web page for every table and displays them but when a cell is empty it throws an exception: "NullReferenceException was unhandled - Object reference not set to an instance of an object.". A snippet of the HTML (In this case I'm searching 'Microsoft' on the website: <div class="boardcontainer"> <table cellpadding="4" cellspacing="1" border="0" width="100%">

Html Agility Pack - <option> inner text

北城余情 提交于 2019-12-01 23:32:07
I have problem with this html: <select id="attribute1021" class="required-entry super-attribute-select" name="super_attribute[1021]"> <option value="">Choose an Option...</option> <option value="281">001 Melaike</option> <option value="280">002 Taronja</option> <option value="289">003 Lill</option> <option value="288">004 Chèn</option> <option value="287">005 Addition</option> <option value="286">006 Iskia</option> <option value="285">007 Milele</option> <option value="284">008 Cali</option> <option value="283">009 Odessa</option> <option value="282">010 Manaus</option> <option value="303">011

Does HtmlAgilityPack have the ability to use regular expressions in its XPATH selector?

拜拜、爱过 提交于 2019-12-01 23:05:06
问题 I would like to be able to create a collection of nodes where the text starts with a word and then a number. For example, given the following: <p>FINDTHIS 1</p> <p>FINDTHIS SOMETEXT</p> <p>FINDTHIS 2</p> I would like to be able to create a collection consisting of two paragraph nodes: FINDTHIS 1 and FINDTHIS 2. One possible approach would be to create an xpath query like //p[starts-with(., 'FINDTHIS ')] and then use a regular expression to determine whether or not the next character is a

Does HtmlAgilityPack have the ability to use regular expressions in its XPATH selector?

ⅰ亾dé卋堺 提交于 2019-12-01 22:02:13
I would like to be able to create a collection of nodes where the text starts with a word and then a number. For example, given the following: <p>FINDTHIS 1</p> <p>FINDTHIS SOMETEXT</p> <p>FINDTHIS 2</p> I would like to be able to create a collection consisting of two paragraph nodes: FINDTHIS 1 and FINDTHIS 2. One possible approach would be to create an xpath query like //p[starts-with(., 'FINDTHIS ')] and then use a regular expression to determine whether or not the next character is a number. If I wanted to obtain a list of matches that returned the above criteria, I could create a regular

Issue with HTMLAgilityPack parsing HTML using C#

痞子三分冷 提交于 2019-12-01 21:46:36
问题 I'm just trying to learn about HTMLAgilityPack and XPath, I'm attempting to get a list of (HTML Links) companies from the NASDAQ website; http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx I currently have the following code; HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument(); // Create a request for the URL. WebRequest request = WebRequest.Create("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx"); // Get the response. HttpWebResponse response = (HttpWebResponse

HTMLAgilityPack - Remove Node with out stripping the inner text

痴心易碎 提交于 2019-12-01 20:01:30
My html content is <a href="#asdf">asdf</a> <H5 align="left"><A href="#d570525d497.htm#toc">Table of Contents</A><br></H5> I'm using HTML Agility Pack to load the html. I want to find <a> nodes and remove the node without removing its inner text as mentioned below asdf <H5 align="left">Table of Contents<br></H5> I'm using below code, var htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(htmlPage); var Nodes = htmlDocument.DocumentNode.SelectNodes("//a"); foreach (var Node in Nodes) { Node.InnerText.Trim(); } It's not working. Something wrong with the code? Remove the node from parent

From the Html Agility Pack download, which one of the 9 “HtmlAgilityPack.dll” do I use?

百般思念 提交于 2019-12-01 19:14:46
There are nine folders in the downloaded zip file for HTML Agility Pack: Net20 Net40 Net40-client Net45 sl3-wp sl4 sl4-windowsphone71 sl5 winrt45 I do not know what these folder names mean. Please explain which one I need in order to scrape data from html files using VS2010. Please explain where I should put the files. The different versions are compiled against different .NET framework versions. Some frameworks, such as the WinRT or the Silverlight frameworks, have more limited functionality or require slightly different (and often slower) approaches to implement the features of the component

simulate infinite scrolling in c# to get full html of a page

我怕爱的太早我们不能终老 提交于 2019-12-01 18:21:43
问题 There are lots of sites that use this (imo) annoying "infinite scrolling" style. Examples of this are sites like tumblr, twitter, 9gag, etc.. I recently tried to scrape some pics off of these sites programatically with HtmlAgilityPack. like this: HtmlWeb web = new HtmlWeb(); HtmlDocument doc = web.Load(url); var primary = doc.DocumentNode.SelectNodes("//img[@class='badge-item-img']"); var picstring = primary.Select(r => r.GetAttributeValue("src", null)).FirstOrDefault(); This works fine, but

HtmlAgilityPack and Authentication

谁说胖子不能爱 提交于 2019-12-01 18:10:37
I have a method to get ids and xpaths if given a particular url. How do I pass in the username and password with the request so that I can scrape a url that requires a username and password? using HtmlAgilityPack; _web = new HtmlWeb(); internal Dictionary<string, string> GetidsAndXPaths(string url) { var webidsAndXPaths = new Dictionary<string, string>(); var doc = _web.Load(url); var nodes = doc.DocumentNode.SelectNodes("//*[@id]"); if (nodes == null) return webidsAndXPaths; // code to get all the xpaths and ids Should I use a web request to get the page source and then pass that file into

HtmlAgilityPack.HtmlDocument Cookies

。_饼干妹妹 提交于 2019-12-01 17:58:24
This pertains to cookies set inside a script (maybe inside a script tag). System.Windows.Forms.HtmlDocument executes those scripts and the cookies set (like document.cookie=etc... ) can be retrieved through its Cookies property. I assume HtmlAgilityPack.HtmlDocument doesn't do this (execution). I wonder if there is an easy way to emulate the System.Windows.Forms.HtmlDocument capabilities (the cookies part). Anyone? When I need to use Cookies and HtmlAgilityPack together, or just create custom requests (for example, set the User-Agent property, etc), here is what I do: Create a class that