I\'m mirroring some internal websites for backup purposes. As of right now I basically use this c# code:
System.Net.WebClient client = new System.Net.WebCli
The most robust solution would be to use the HTMLAgilityPack as others have suggested. However a reasonable solution using regular expressions is possible using the Replace overload that takes a MatchEvaluator delegate, as follows:
var baseUri = new Uri("http://test.com");
var pattern = @"(?src|href)=""(?/[^""]*)""";
var matchEvaluator = new MatchEvaluator(
match =>
{
var value = match.Groups["value"].Value;
Uri uri;
if (Uri.TryCreate(baseUri, value, out uri))
{
var name = match.Groups["name"].Value;
return string.Format("{0}=\"{1}\"", name, uri.AbsoluteUri);
}
return null;
});
var adjustedHtml = Regex.Replace(originalHtml, pattern, matchEvaluator);
The above sample searches for attributes named src and href that contain double quoted values starting with a forward slash. For each match, the static Uri.TryCreate method is used to determine if the value is a valid relative uri.
Note that this solution doesn't handle single quoted attribute values and certainly doesn't work on poorly formed HTML with unquoted values.