I have read that HTMLAgility 1.4 is a great solution to scraping a webpage. Being a new programmer I am hoping I could get some input on this project. I am doing this as a c
The beginning part is off:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("http://localhost");
LoadHtml(html)
loads an html string into the document, I think you want something like this instead:
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc = htmlWeb.Load("http://stackoverflow.com");
A working code, according to the HTML source you provided. It can be factorized, and I'm not checking for null values (in rows
, cells
, and each value inside the case
). If you have the page in 127.0.0.1, that will work. Just paste it inside the Main
method of a Console Application and try to understand it.
HtmlDocument doc = new HtmlWeb().Load("http://127.0.0.1");
var rows = doc.DocumentNode.SelectNodes("//table[@class='data']/tr");
foreach (var row in rows)
{
var cells = row.SelectNodes("./td");
string title = cells[0].InnerText;
var valueRow = cells[2];
switch (title)
{
case "Part-Num":
string partNum = valueRow.SelectSingleNode("./img[@alt]").Attributes["alt"].Value;
Console.WriteLine("Part-Num:\t" + partNum);
break;
case "Manu-Number":
string manuNumber = valueRow.SelectSingleNode("./img[@alt]").Attributes["alt"].Value;
Console.WriteLine("Manu-Num:\t" + manuNumber);
break;
case "Description":
string description = valueRow.InnerText;
Console.WriteLine("Description:\t" + description);
break;
case "Manu-Country":
string manuCountry = valueRow.InnerText;
Console.WriteLine("Manu-Country:\t" + manuCountry);
break;
case "Last Modified":
string lastModified = valueRow.InnerText;
Console.WriteLine("Last Modified:\t" + lastModified);
break;
case "Last Modified By":
string lastModifiedBy = valueRow.InnerText;
Console.WriteLine("Last Modified By:\t" + lastModifiedBy);
break;
}
}