html-agility-pack

XPath selecting between comments multiple times

孤人 提交于 2019-12-11 11:35:44
问题 We need to read node between HTML comments: <html> <!-- comment 1 --> <div>some text</div> <div><p>Some more elements</p></div> <!-- end content --> <!-- comment 2 --> <div>some text</div> <div><p>Some more elements</p> <!-- end content --> </div> </html> I tried using the below XPath: //*[preceding-sibling::comment()[contains(., 'comment 1')]][following-sibling::comment()[contains(., 'end content')]] It works fine for first comment i.e. comment 1 but not working for second comment following

HTMLAgilityPack, HTML duplicate IDs

我与影子孤独终老i 提交于 2019-12-11 10:35:39
问题 Hi: This is similar to this one here. But needs to be done at the server level rather at the client level. Currently I use HTMLAgilityPack, is there anyway I could detect duplicate IDs? Thanks in advance. 回答1: Here's a quick way to do it: HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(htmlString); var count = new Dictionary<string, int>(); foreach (var node in doc.DocumentNode.Descendants()) { string id = node.GetAttributeValue("id", null); if (id != null) { if (count.ContainsKey(id))

Scraping HTML from Google Translate

僤鯓⒐⒋嵵緔 提交于 2019-12-11 10:15:55
问题 I want to translate a string using Google Translator. My sample string is "this is my string" . I want to use HTML Agility Pack to parse HTML documents. I tried this: using HtmlAgilityPack; ........ var webGet = new HtmlWeb(); var document = webGet.Load( "http://translate.google.com/#en/bn/this%20is%20my%20string"); var node = document.DocumentNode.SelectNodes( "//span[@class='short_text' and @id='result_box']"); if (node != null) { foreach (var xx in node) { x = xx.InnerText; MessageBox.Show

C# HTML Agility Pack Single Select Node returning null

♀尐吖头ヾ 提交于 2019-12-11 08:56:08
问题 I have a web scraper developed using C#, windows forms and the HTML Agility Pack. I had it all working great when the site changed it's code and broke it. I know it happens often with web scrapers but now I am having trouble figuring out how to correct the issue. At this time my scraper loops thru multiple URL's and scrapes data from each page. The problem I am running into is that the template of the site it loops thru will randomly show the newer template which does not have the same HTML

select divs and put into collection using htmlagilitypack not working

大兔子大兔子 提交于 2019-12-11 08:28:43
问题 Why does this not work? I get a null reference exception error on the foreach loop as it starts I'm trying to get all the divs text on a page and put each one into my own collection Imports HtmlAgilityPack Imports System.Xml Partial Class _Default Inherits System.Web.UI.Page Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load Dim webGet As HtmlWeb = New HtmlWeb Dim htmlDoc As HtmlDocument = webGet.Load("http://www.mysite.com") Dim ids As New List(Of

Can't figure how to parse using HTML Agility Pack

让人想犯罪 __ 提交于 2019-12-11 08:02:36
问题 I have the following chunk of HTML code but i cant figure how i can get the designated values <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> </head> <body > <form name="form1" method="post" action="" id="form1"> <div> <table class="tableclass" > <tbody> <tr> <tr> <td colspan="5" class="myclass1"><span id="myclass2">value1</span></td> </tr> <tr id="idvalue" aa="1" class

Ignoring parse errors with Html Agility Pack?

◇◆丶佛笑我妖孽 提交于 2019-12-11 07:27:46
问题 I'm trying to parse out a single page from YouTube... Which isn't really free of syntax errors. Html Agility Pack screams about these errors, and returns nothing in result. http://codepaste.net/gh3hco 回答1: I haven't tried this, but based on a suggestion in their forum you can use HTML Tidy or Tidy.NET to clean the HTML first. Optionally, you could find the erroneous tags and remove them in a pre-process step. 来源: https://stackoverflow.com/questions/6182404/ignoring-parse-errors-with-html

Html Agility Pack nextsibling not finding element if there are white spaces between tags

房东的猫 提交于 2019-12-11 07:15:59
问题 I'm trying to find the nextSibling innerText of a specific tag, but I can't get the proper value when I try to parse an html string which contains white spaces or new lines here is my code: private string getTableTdValue() { /*If I strip all white space between tag I get proper results string myHtml = "<td align='right' width='186'>Text1</td><td align='center' width='51'>Here the result I want to get</td><td width='186'>Text2</td>";*/ string myHtml = @" <td align='right' width='186'>Text1</td

HtmlAgilityPack-PCL + LINQ

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-11 07:13:35
问题 Well, basically I have a Windows Phone 8.1 app that's supposed to download the html file and parse it using HtmlAgilityPack-PCL and LINQ. var nodes = from tr in doc.DocumentNode.Descendants("body") from td in tr.Descendants("div").Where(x => x.Attributes["id"].Value == "screen")select tr; Then I'm trying to get the node from nodes : HtmlNode node = nodes.FirstOrDefault(); And this is the point where i have an exeption "Object reference not set to an instance of an object." The html file

HTML Agility Pack - Select node after particular paragraph

别说谁变了你拦得住时间么 提交于 2019-12-11 05:29:33
问题 I have this kind of situation : various files with the following HTML. I need to retreive only the list after "targetWord" paragraph (of course it changes position in the pages I need to parse). How can I do with HTML Agility Pack? <p>Word1</p> <ul> <li>listobject1</li> <li>listobject2</li> <li>listobject3</li> </ul> <p>targetWord</p> <ul> <li>listobject4</li> <li>listobject5</li> <li>listobject6</li> </ul> <p>Word2</p> <ul> <li>listobject7</li> <li>listobject8</li> <li>listobject9</li> </ul>