I\'m mirroring some internal websites for backup purposes. As of right now I basically use this c# code:
System.Net.WebClient client = new System.Net.WebCli
I know this is an older question, but I figured out how to do it with a fairly simple regex. It works well for me. It handles http/https and also root-relative and current directory-relative.
var host = "http://www.google.com/";
var baseUrl = host + "images/";
var html = "
";
var regex = "(?<=(?:href|src)=\")(?!https?://)(?[^\"]+)";
html = Regex.Replace(
html,
regex,
match => match.Groups["url"].Value.StartsWith("/")
? host + match.Groups["url"].Value.Substring(1)
: baseUrl + match.Groups["url"].Value);