I\'m mirroring some internal websites for backup purposes. As of right now I basically use this c# code:
System.Net.WebClient client = new System.Net.WebCli
You could use the HTMLAgilityPack accomplish this. You would do something along these (not tested) lines:
Here are a few examples:
Relative to absolute paths in HTML (asp.net)
http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home
http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/
Just use this function
'# converts relative URL ro Absolute URI
Function RelativeToAbsoluteUrl(ByVal baseURI As Uri, ByVal RelativeUrl As String) As Uri
' get action tags, relative or absolute
Dim uriReturn As Uri = New Uri(RelativeUrl, UriKind.RelativeOrAbsolute)
' Make it absolute if it's relative
If Not uriReturn.IsAbsoluteUri Then
Dim baseUrl As Uri = baseURI
uriReturn = New Uri(baseUrl, uriReturn)
End If
Return uriReturn
End Function
The most robust solution would be to use the HTMLAgilityPack as others have suggested. However a reasonable solution using regular expressions is possible using the Replace overload that takes a MatchEvaluator delegate, as follows:
var baseUri = new Uri("http://test.com");
var pattern = @"(?<name>src|href)=""(?<value>/[^""]*)""";
var matchEvaluator = new MatchEvaluator(
match =>
{
var value = match.Groups["value"].Value;
Uri uri;
if (Uri.TryCreate(baseUri, value, out uri))
{
var name = match.Groups["name"].Value;
return string.Format("{0}=\"{1}\"", name, uri.AbsoluteUri);
}
return null;
});
var adjustedHtml = Regex.Replace(originalHtml, pattern, matchEvaluator);
The above sample searches for attributes named src and href that contain double quoted values starting with a forward slash. For each match, the static Uri.TryCreate method is used to determine if the value is a valid relative uri.
Note that this solution doesn't handle single quoted attribute values and certainly doesn't work on poorly formed HTML with unquoted values.
Uri WebsiteImAt = new Uri(
"http://www.w3schools.com/media/media_mimeref.asp?q=1&s=2,2#a");
string href = new Uri(WebsiteImAt, "/something/somethingelse/filename.asp")
.AbsoluteUri;
string href2 = new Uri(WebsiteImAt, "something.asp").AbsoluteUri;
string href3 = new Uri(WebsiteImAt, "something").AbsoluteUri;
which with your Regex
-based approach is probably (untested) mappable to:
String value = Regex.Replace(text, "<(.*?)(src|href)=\"(?!http)(.*?)\"(.*?)>", match =>
"<" + match.Groups[1].Value + match.Groups[2].Value + "=\""
+ new Uri(WebsiteImAt, match.Groups[3].Value).AbsoluteUri + "\""
+ match.Groups[4].Value + ">",RegexOptions.IgnoreCase | RegexOptions.Multiline);
I should also advise not to use Regex
here, but to apply the Uri trick to some code using a DOM, perhaps XmlDocument
(if xhtml) or the HTML Agility Pack (otherwise), looking at all //@src
or //@href
attributes.