html-agility-pack

Screen Scraping, Web Scraping, Web Harvesting, Web Data Extraction, etc. using C# and the .NET Framework

坚强是说给别人听的谎言 提交于 2019-12-10 10:57:50
问题 I am working on a Microsoft .NET Application in C# for Web Harvesting, Web Scraping, Web Data Extraction, Screen Scraping, etc. whatever you want to call it. For parsing HTML, I'm attempting to incorporate HTML Agility Pack but it's not as easy as I thought it would be. I have included some specifications and images of what I have so far and was hoping to get your opinions on how I could proceed. basically, I want to do something similar to the layout used in Visual Web Ripper but I have no

Find and remove specified HTML tags using Html Agility Pack

泄露秘密 提交于 2019-12-08 17:30:19
问题 I'm trying to get Html Agility Pack to work in my case. I need to detect all script elements in an existing HTML page and remove them, saving the changes to another file. Here, bodyNode returns the correct number of script tags, but I can't remove them. The new file still has those tags. if (doc.DocumentNode != null) { var bodyNode = doc.DocumentNode.SelectNodes("//script"); if (bodyNode != null) { bodyNode.Clear(); // clears the collection only } doc.Save("some file"); } 回答1: You need to do

Html Agility Pack help

时光毁灭记忆、已成空白 提交于 2019-12-08 16:23:47
问题 I'm trying to scrape some information from a website but can't find a solution that works for me. Every code I read on the Internet generates at least one error for me. Even the example code at their homepage generates errors for me. My code: HtmlDocument doc = new HtmlDocument(); doc.Load("https://www.flashback.org/u479804"); foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"]) { HtmlAttribute att = link["href"]; att.Value = FixLink(att); } doc.Save("file.htm"); Generates

HTML Agility pack create new HTMLNode

China☆狼群 提交于 2019-12-08 16:14:21
问题 I'm using HTML Agility Pack to parse and transform a HTML file, but I get an exception "Item has already been added" when try to create a new HTMLNode because of the index parameter. HtmlNode node1 = new HtmlNode(HtmlNodeType.Element, doc, 0); node1.Name = "div"; HtmlNode node2 = new HtmlNode(HtmlNodeType.Element, doc, 0); node2.Name = "div"; 回答1: This is how you can create a node (it basically mimics System.Xml semantics, on purpose): HtmlNode div = doc.CreateElement("div"); myNode.Append

how to access child node from node in htmlagility pack

我们两清 提交于 2019-12-08 14:40:04
问题 <html> <body> <div class="main"> <div class="submain"><h2></h2><p></p><ul></ul> </div> <div class="submain"><h2></h2><p></p><ul></ul> </div> </div> </body> </html> I loaded the html into an HtmlDocument . Then I selected the XPath as submain . Then I dont know how to access to each tags i.e h2 , p separately. HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[@class=\"submain\"]"); foreach (HtmlAgilityPack.HtmlNode node in nodes) {} If I Use node.InnerText I get

Difference between IE 10 rendered HTML and its output HTML

我们两清 提交于 2019-12-08 13:52:45
问题 I use IE 10 F12 button to locate a < a > node on my page, I got this: <a tabindex="-1" class="level1 static" href="About.aspx">About</a> But I use the following code to retrieve the page HTML, and get this: <a class="level1" href="About.aspx">About</a> Code: WebClient wc = new WebClient(); String pageString = wc.DownloadString(url); Why are they different? Update: Below is the Fiddler monitor result. IE10: WebClient: 回答1: It's typical for webservers to send different output depending on which

Fetching google images using htmlagilitypack

让人想犯罪 __ 提交于 2019-12-08 13:32:57
问题 I would like to execute a query on google images to fetch images using htmlagilitypack in c#. For this I used an xpath request to the image //*[@id="rg_s"]/div[1]/a/img But it fails to fetch the image that way. What could be the correct way of doing this? 回答1: you can try this too : Here its possible to get the links of images by following var links = HtmlDocument.DocumentNode.SelectNodes("//a").Where(a => a.InnerHtml.Contains("<img")).Select(b => b.Attributes["href"].Value).ToList(); foreach

c# read a list of anonymous type occurs an error with a foreach

与世无争的帅哥 提交于 2019-12-08 12:50:16
问题 I need to get data from this list, however when I put my foreach of a runtime error saying that the list contains null objects. But if I remove the foreach and put a breakpoint on the line where listbox.itemsSource receives the list, I see that I have the list loaded with all items correctly. var imgs = e.Document.DocumentNode.SelectNodes(@"//img[@src]") .Select(img => new { Link = img.Attributes["src"].Value, Title = img.Attributes["alt"].Value, }).ToList(); listBoxPopular.ItemsSource = imgs

Select elements added to the DOM by a script

隐身守侯 提交于 2019-12-08 12:13:03
问题 I've been trying to get either an <object> or an <embed> tag using: HtmlNode videoObjectNode = doc.DocumentNode.SelectSingleNode("//object"); HtmlNode videoEmbedNode = doc.DocumentNode.SelectSingleNode("//embed"); This doesn't seem to work. Can anyone please tell me how to get these tags and their InnerHtml? A YouTube embedded video looks like this: <embed height="385" width="640" type="application/x-shockwave-flash" src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player"

Parsing form with HTML Agility Pack

牧云@^-^@ 提交于 2019-12-08 11:54:08
问题 I'm trying to extract all input elements from a form. When I parse the following form: <form> <input name='test1' type='text'> <input name='test2' type='text'> <input name='test3' type='text'> </form> everything worked perfectly, HTML Agility Pack was able to detect the input elements in the form but if it has a div parent node like the following, it will not be detected. <form> <div><input name='test1' type='text'></div> <div><input name='test2' type='text'></div> <div><input name='test3'