I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes.
The first example on the home page does something very similar, but consider:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
string href = link["href"].Value;
// store href somewhere
}
So you can imagine that for img@src, just replace each a with img, and href with src.
You might even be able to simplify to:
foreach(HtmlNode node in doc.DocumentElement
.SelectNodes("//a/@href | //img/@src")
{
list.Add(node.Value);
}
For relative url handling, look at the Uri class.
The example and the accepted answer is wrong. It doesn't compile with the latest version. I try something else:
private List<string> ParseLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
r => r.Attributes.ToList().ConvertAll(
i => i.Value)).SelectMany(j => j).ToList();
}
This works for me.
Maybe I am too late here to post an answer. The following worked for me:
var MainImageString = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();
You also need to take into account the document base URL element (<base>) and protocol relative URLs (for example //www.foo.com/bar/).
For more information check:
- <base>: The Document Base URL element page on MDN
- The Protocol-relative URL article by Paul Irish
- What are the recommendations for html tag? discussion on StackOverflow
- Uri Constructor (Uri, Uri) page on MSDN
- Uri class doesn't handle the protocol-relative URL discussion no StackOverflow
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string name = htmlDoc.DocumentNode
.SelectNodes("//td/input")
.First()
.Attributes["value"].Value;
来源:https://stackoverflow.com/questions/4835868/how-to-get-img-src-or-a-hrefs-using-html-agility-pack