html-agility-pack | 易学教程

Screen Scraping, Web Scraping, Web Harvesting, Web Data Extraction, etc. using C# and the .NET Framework

阅读更多关于 Screen Scraping, Web Scraping, Web Harvesting, Web Data Extraction, etc. using C# and the .NET Framework

问题 I am working on a Microsoft .NET Application in C# for Web Harvesting, Web Scraping, Web Data Extraction, Screen Scraping, etc. whatever you want to call it. For parsing HTML, I'm attempting to incorporate HTML Agility Pack but it's not as easy as I thought it would be. I have included some specifications and images of what I have so far and was hoping to get your opinions on how I could proceed. basically, I want to do something similar to the layout used in Visual Web Ripper but I have no

Find and remove specified HTML tags using Html Agility Pack

阅读更多关于 Find and remove specified HTML tags using Html Agility Pack

问题 I'm trying to get Html Agility Pack to work in my case. I need to detect all script elements in an existing HTML page and remove them, saving the changes to another file. Here, bodyNode returns the correct number of script tags, but I can't remove them. The new file still has those tags. if (doc.DocumentNode != null) { var bodyNode = doc.DocumentNode.SelectNodes("//script"); if (bodyNode != null) { bodyNode.Clear(); // clears the collection only } doc.Save("some file"); } 回答1: You need to do

Html Agility Pack help

阅读更多关于 Html Agility Pack help

问题 I'm trying to scrape some information from a website but can't find a solution that works for me. Every code I read on the Internet generates at least one error for me. Even the example code at their homepage generates errors for me. My code: HtmlDocument doc = new HtmlDocument(); doc.Load("https://www.flashback.org/u479804"); foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"]) { HtmlAttribute att = link["href"]; att.Value = FixLink(att); } doc.Save("file.htm"); Generates

HTML Agility pack create new HTMLNode

阅读更多关于 HTML Agility pack create new HTMLNode

问题 I'm using HTML Agility Pack to parse and transform a HTML file, but I get an exception "Item has already been added" when try to create a new HTMLNode because of the index parameter. HtmlNode node1 = new HtmlNode(HtmlNodeType.Element, doc, 0); node1.Name = "div"; HtmlNode node2 = new HtmlNode(HtmlNodeType.Element, doc, 0); node2.Name = "div"; 回答1: This is how you can create a node (it basically mimics System.Xml semantics, on purpose): HtmlNode div = doc.CreateElement("div"); myNode.Append

how to access child node from node in htmlagility pack

阅读更多关于 how to access child node from node in htmlagility pack

问题 <html> <body> <div class="main"> <div class="submain"><h2></h2><p></p><ul></ul> </div> <div class="submain"><h2></h2><p></p><ul></ul> </div> </div> </body> </html> I loaded the html into an HtmlDocument . Then I selected the XPath as submain . Then I dont know how to access to each tags i.e h2 , p separately. HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[@class=\"submain\"]"); foreach (HtmlAgilityPack.HtmlNode node in nodes) {} If I Use node.InnerText I get

Difference between IE 10 rendered HTML and its output HTML

阅读更多关于 Difference between IE 10 rendered HTML and its output HTML

问题 I use IE 10 F12 button to locate a < a > node on my page, I got this: <a tabindex="-1" class="level1 static" href="About.aspx">About</a> But I use the following code to retrieve the page HTML, and get this: <a class="level1" href="About.aspx">About</a> Code: WebClient wc = new WebClient(); String pageString = wc.DownloadString(url); Why are they different? Update: Below is the Fiddler monitor result. IE10: WebClient: 回答1: It's typical for webservers to send different output depending on which

Fetching google images using htmlagilitypack

阅读更多关于 Fetching google images using htmlagilitypack

问题 I would like to execute a query on google images to fetch images using htmlagilitypack in c#. For this I used an xpath request to the image //*[@id="rg_s"]/div[1]/a/img But it fails to fetch the image that way. What could be the correct way of doing this? 回答1: you can try this too : Here its possible to get the links of images by following var links = HtmlDocument.DocumentNode.SelectNodes("//a").Where(a => a.InnerHtml.Contains("<img")).Select(b => b.Attributes["href"].Value).ToList(); foreach

c# read a list of anonymous type occurs an error with a foreach

阅读更多关于 c# read a list of anonymous type occurs an error with a foreach

问题 I need to get data from this list, however when I put my foreach of a runtime error saying that the list contains null objects. But if I remove the foreach and put a breakpoint on the line where listbox.itemsSource receives the list, I see that I have the list loaded with all items correctly. var imgs = e.Document.DocumentNode.SelectNodes(@"//img[@src]") .Select(img => new { Link = img.Attributes["src"].Value, Title = img.Attributes["alt"].Value, }).ToList(); listBoxPopular.ItemsSource = imgs

Select elements added to the DOM by a script

阅读更多关于 Select elements added to the DOM by a script

问题 I've been trying to get either an <object> or an <embed> tag using: HtmlNode videoObjectNode = doc.DocumentNode.SelectSingleNode("//object"); HtmlNode videoEmbedNode = doc.DocumentNode.SelectSingleNode("//embed"); This doesn't seem to work. Can anyone please tell me how to get these tags and their InnerHtml? A YouTube embedded video looks like this: <embed height="385" width="640" type="application/x-shockwave-flash" src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player"

Parsing form with HTML Agility Pack

阅读更多关于 Parsing form with HTML Agility Pack

问题 I'm trying to extract all input elements from a form. When I parse the following form: <form> <input name='test1' type='text'> <input name='test2' type='text'> <input name='test3' type='text'> </form> everything worked perfectly, HTML Agility Pack was able to detect the input elements in the form but if it has a div parent node like the following, it will not be detected. <form> <div><input name='test1' type='text'></div> <div><input name='test2' type='text'></div> <div><input name='test3'