问题
I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes.
回答1:
The first example on the home page does something very similar, but consider:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
string href = link["href"].Value;
// store href somewhere
}
So you can imagine that for img@src, just replace each a with img, and href with src.
You might even be able to simplify to:
foreach(HtmlNode node in doc.DocumentElement
.SelectNodes("//a/@href | //img/@src")
{
list.Add(node.Value);
}
For relative url handling, look at the Uri class.
回答2:
The example and the accepted answer is wrong. It doesn't compile with the latest version. I try something else:
private List<string> ParseLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
r => r.Attributes.ToList().ConvertAll(
i => i.Value)).SelectMany(j => j).ToList();
}
This works for me.
回答3:
Maybe I am too late here to post an answer. The following worked for me:
var MainImageString = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();
回答4:
You also need to take into account the document base URL element (<base>) and protocol relative URLs (for example //www.foo.com/bar/).
For more information check:
- <base>: The Document Base URL element page on MDN
- The Protocol-relative URL article by Paul Irish
- What are the recommendations for html tag? discussion on StackOverflow
- Uri Constructor (Uri, Uri) page on MSDN
- Uri class doesn't handle the protocol-relative URL discussion no StackOverflow
回答5:
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string name = htmlDoc.DocumentNode
.SelectNodes("//td/input")
.First()
.Attributes["value"].Value;
Source: https://html-agility-pack.net/select-nodes
来源:https://stackoverflow.com/questions/4835868/how-to-get-img-src-or-a-hrefs-using-html-agility-pack