C# Convert Relative to Absolute Links in HTML String

前端 未结 10 1647
余生分开走
余生分开走 2020-12-16 04:03

I\'m mirroring some internal websites for backup purposes. As of right now I basically use this c# code:

System.Net.WebClient client = new System.Net.WebCli         


        
相关标签:
10条回答
  • 2020-12-16 04:49

    You could use the HTMLAgilityPack accomplish this. You would do something along these (not tested) lines:

    • Load the url
    • Select all links
    • Load the link into a Uri and test whether it is relative If it relative convert it to absolute
    • Update the links value with the new uri
    • save the file

    Here are a few examples:

    Relative to absolute paths in HTML (asp.net)

    http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home

    http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/

    0 讨论(0)
  • 2020-12-16 04:52

    Just use this function

    '# converts relative URL ro Absolute URI
        Function RelativeToAbsoluteUrl(ByVal baseURI As Uri, ByVal RelativeUrl As String) As Uri
            ' get action tags, relative or absolute
            Dim uriReturn As Uri = New Uri(RelativeUrl, UriKind.RelativeOrAbsolute)
            ' Make it absolute if it's relative
            If Not uriReturn.IsAbsoluteUri Then
                Dim baseUrl As Uri = baseURI
                uriReturn = New Uri(baseUrl, uriReturn)
            End If
            Return uriReturn
        End Function
    
    0 讨论(0)
  • 2020-12-16 04:56

    The most robust solution would be to use the HTMLAgilityPack as others have suggested. However a reasonable solution using regular expressions is possible using the Replace overload that takes a MatchEvaluator delegate, as follows:

    var baseUri = new Uri("http://test.com");
    var pattern = @"(?<name>src|href)=""(?<value>/[^""]*)""";
    var matchEvaluator = new MatchEvaluator(
        match =>
        {
            var value = match.Groups["value"].Value;
            Uri uri;
    
            if (Uri.TryCreate(baseUri, value, out uri))
            {
                var name = match.Groups["name"].Value;
                return string.Format("{0}=\"{1}\"", name, uri.AbsoluteUri);
            }
    
            return null;
        });
    var adjustedHtml = Regex.Replace(originalHtml, pattern, matchEvaluator);
    

    The above sample searches for attributes named src and href that contain double quoted values starting with a forward slash. For each match, the static Uri.TryCreate method is used to determine if the value is a valid relative uri.

    Note that this solution doesn't handle single quoted attribute values and certainly doesn't work on poorly formed HTML with unquoted values.

    0 讨论(0)
  • 2020-12-16 04:56
    Uri WebsiteImAt = new Uri(
           "http://www.w3schools.com/media/media_mimeref.asp?q=1&s=2,2#a");
    string href = new Uri(WebsiteImAt, "/something/somethingelse/filename.asp")
           .AbsoluteUri;
    string href2 = new Uri(WebsiteImAt, "something.asp").AbsoluteUri;
    string href3 = new Uri(WebsiteImAt, "something").AbsoluteUri;
    

    which with your Regex-based approach is probably (untested) mappable to:

            String value = Regex.Replace(text, "<(.*?)(src|href)=\"(?!http)(.*?)\"(.*?)>", match => 
                "<" + match.Groups[1].Value + match.Groups[2].Value + "=\""
                    + new Uri(WebsiteImAt, match.Groups[3].Value).AbsoluteUri + "\""
                    + match.Groups[4].Value + ">",RegexOptions.IgnoreCase | RegexOptions.Multiline);
    

    I should also advise not to use Regex here, but to apply the Uri trick to some code using a DOM, perhaps XmlDocument (if xhtml) or the HTML Agility Pack (otherwise), looking at all //@src or //@href attributes.

    0 讨论(0)
提交回复
热议问题