C# Convert Relative to Absolute Links in HTML String

前端 未结 10 1681
余生分开走
余生分开走 2020-12-16 04:03

I\'m mirroring some internal websites for backup purposes. As of right now I basically use this c# code:

System.Net.WebClient client = new System.Net.WebCli         


        
10条回答
  •  余生分开走
    2020-12-16 04:56

    The most robust solution would be to use the HTMLAgilityPack as others have suggested. However a reasonable solution using regular expressions is possible using the Replace overload that takes a MatchEvaluator delegate, as follows:

    var baseUri = new Uri("http://test.com");
    var pattern = @"(?src|href)=""(?/[^""]*)""";
    var matchEvaluator = new MatchEvaluator(
        match =>
        {
            var value = match.Groups["value"].Value;
            Uri uri;
    
            if (Uri.TryCreate(baseUri, value, out uri))
            {
                var name = match.Groups["name"].Value;
                return string.Format("{0}=\"{1}\"", name, uri.AbsoluteUri);
            }
    
            return null;
        });
    var adjustedHtml = Regex.Replace(originalHtml, pattern, matchEvaluator);
    

    The above sample searches for attributes named src and href that contain double quoted values starting with a forward slash. For each match, the static Uri.TryCreate method is used to determine if the value is a valid relative uri.

    Note that this solution doesn't handle single quoted attribute values and certainly doesn't work on poorly formed HTML with unquoted values.

提交回复
热议问题