Currently I use .Net WebBrowser.Document.Images() to do this. It requires the Webrowser to load the document. It\'s messy and takes up resources. <
The big issue with any HTML parsing is the "well formed" part. You've seen the crap HTML out there - how much of it is really well formed? I needed to do something similar - parse out all links in a document (and in my case) update them with a rewritten link. I found the Html Agility Pack over on CodePlex. It rocks (and handles malformed HTML).
Here's a snippet for iterating over links in a document:
HtmlDocument doc = new HtmlDocument();
doc.Load(@"C:\Sample.HTM");
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a/@href");
Content match = null;
// Run only if there are links in the document.
if (linkNodes != null)
{
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute attrib = linkNode.Attributes["href"];
// Do whatever else you need here
}
}
Original Blog Post