html-agility-pack

HTML agility pack - removing unwanted tags without removing content?

北战南征 提交于 2019-11-26 22:06:34
I've seen a few related questions out here, but they don’t exactly talk about the same problem I am facing. I want to use the HTML Agility Pack to remove unwanted tags from my HTML without losing the content within the tags. So for instance, in my scenario, I would like to preserve the tags " b ", " i " and " u ". And for an input like: <p>my paragraph <div>and my <b>div</b></div> are <i>italic</i> and <b>bold</b></p> The resulting HTML should be: my paragraph and my <b>div</b> are <i>italic</i> and <b>bold</b> I tried using HtmlNode 's Remove method, but it removes my content too. Any

HtmlAgilityPack.HtmlNode no definition for SelectNodes

自作多情 提交于 2019-11-26 22:03:43
问题 I am trying to use the HtmlAgilityPack to finds elements within a website. My Problem is the following: I created a Windows 8 universal app (c#) With the NuGet Manager I added: using System.Net.Http; using HtmlAgilityPack; Then i did: string htmlPage; using (var client = new HttpClient()) { htmlPage = await client.GetStringAsync("http://www.domain.de/"); } HtmlDocument myDocument = new HtmlDocument(); myDocument.LoadHtml(htmlPage); //this line results an error @ "SelectNodes" var metaTags =

Parsing HTML Reading Option Tag Content with HtmlAgillityPack

假装没事ソ 提交于 2019-11-26 21:57:37
问题 I am trying to use HtmlAgilityPack to parse HTML, but am having problems. Sample HTML Doc: <tr> <td class="css_lokalita" colspan="4"> <select id="region" name="region"> <option value="0" selected>Všetky regiony</option> <optgroup>Banskobystrický kraj</optgroup> <option value="k_1" style="color: #000000; font-weight:bold;">Banskobystrický kraj</option> <option value="1">   Banská Bystrica</option> . . . <option value="174">   CZ - Ústecký kraj</option> <option value="175">   CZ - Zlínský kraj<

HTML Agility Pack HtmlDocument Show All Html?

ぐ巨炮叔叔 提交于 2019-11-26 21:28:58
问题 I am using the following to get a web page which works fine public static HtmlDocument GetWebPageFromUrl(string url) { var hw = new HtmlWeb(); return hw.Load(url); } But how to I spit the entire contents of the HTML out from the HtmlDocument into a string? I tried HtmlDocument.ToString() but that doesn't give me all the HTML in the document? Any ideas? 回答1: DocumentNode.OuterHtml contains the full html: HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.Load("sample

Scraping a webpage with C# and HTMLAgility

﹥>﹥吖頭↗ 提交于 2019-11-26 21:27:05
问题 I have read that HTMLAgility 1.4 is a great solution to scraping a webpage. Being a new programmer I am hoping I could get some input on this project. I am doing this as a c# application form. The page I am working with is fairly straight forward. The information I need is stuck between just 2 tags and . My goal is to pull the data for Part-Num, Manu-Number, Description, Manu-Country, Last Modified, Last Modified By out of the page and send the data to a sql table. One twist is that there is

Grab all text from html with Html Agility Pack

早过忘川 提交于 2019-11-26 20:26:15
Input <html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html> Output foo bar baz I know of htmldoc.DocumentNode.InnerText , but it will give foobarbaz - I want to get each text, not all at a time. var root = doc.DocumentNode; var sb = new StringBuilder(); foreach (var node in root.DescendantNodesAndSelf()) { if (!node.HasChildNodes) { string text = node.InnerText; if (!string.IsNullOrEmpty(text)) sb.AppendLine(text.Trim()); } } This does what you need, but I am not sure if this is the best way. Maybe you should iterate through something other than

Parsing HTML page with HtmlAgilityPack

谁都会走 提交于 2019-11-26 19:56:21
Using C# I would like to know how to get the Textbox value (i.e: john) from this sample html script : <TD class=texte width="50%"> <DIV align=right>Name :<B> </B></DIV></TD> <TD width="50%"><INPUT class=box value=John maxLength=16 size=16 name=user_name> </TD> <TR vAlign=center> gpmcadam There are a number of ways to select elements using the agility pack. Let's assume we have defined our HtmlDocument as follows: string html = @"<TD class=texte width=""50%""> <DIV align=right>Name :<B> </B></DIV></TD> <TD width=""50%""> <INPUT class=box value=John maxLength=16 size=16 name=user_name> </TD> <TR

Html Agility Pack. Load and scrape webpage

旧城冷巷雨未停 提交于 2019-11-26 19:50:14
问题 Is this the best way to get a webpage when scraping? HttpWebRequest oReq = (HttpWebRequest)WebRequest.Create(url); HttpWebResponse resp = (HttpWebResponse)oReq.GetResponse(); var doc = new HtmlAgilityPack.HtmlDocument(); doc.Load(resp.GetResponseStream()); var element = doc.GetElementbyId("//start-left"); var element2 = doc.DocumentNode.SelectSingleNode("//body"); string html = doc.DocumentNode.OuterHtml; I've seen HtmlWeb().Load to get a webpage. Is that a better alternative to load and the

How to select node types which are HtmlNodeType.Comment using HTMLAgilityPack

我与影子孤独终老i 提交于 2019-11-26 18:33:43
问题 I wish to remove from html things like <!--[if gte mso 9]> ... <![endif]--> <!--[if gte mso 10]> ... <![endif]--> How to do this in C# using HTMLAgilityPack? I'm using static void RemoveTag(HtmlNode node, string tag) { var nodeCollection = node.SelectNodes("//"+ tag ); if(nodeCollection!=null) foreach (HtmlNode nodeTag in nodeCollection) { nodeTag.Remove(); } } for normal tags. 回答1: public static void RemoveComments(HtmlNode node) { foreach (var n in node.ChildNodes.ToArray()) RemoveComments

SelectNodes with XPath ignoring cases

这一生的挚爱 提交于 2019-11-26 17:12:53
问题 I have a problem finding elements in XPath that's contains a certain string ignoring character casing. I want to find in a HTML page all the nodes with id contains the text "footer" ignoring it's write in uppercase or lowercase. In my example I have a different html text like this: <div id="footer">some text</div> <div id="anotherfooter">some text</div> <div id="AnotherFooter">some text</div> <div id="AnotherFooterAgain">some text</div> I need to select all nodes (or any combination in any