C# Convert Relative to Absolute Links in HTML String

前端 未结 10 1646
余生分开走
余生分开走 2020-12-16 04:03

I\'m mirroring some internal websites for backup purposes. As of right now I basically use this c# code:

System.Net.WebClient client = new System.Net.WebCli         


        
相关标签:
10条回答
  • 2020-12-16 04:35

    While this may not be the most robust of solutions it should get the job done.

    var host = "http://domain.is";
    var someHtml = @"
    <a href=""/some/relative"">Relative</a>
    <img src=""/some/relative"" />
    <a href=""http://domain.is/some/absolute"">Absolute</a>
    <img src=""http://domain.is/some/absolute"" />
    ";
    
    
    someHtml = someHtml.Replace("src=\"" + host,"src=\"");
    someHtml = someHtml.Replace("href=\"" + host,"src=\"");
    someHtml = someHtml.Replace("src=\"","src=\"" + host);
    someHtml = someHtml.Replace("href=\"","src=\"" + host);
    
    0 讨论(0)
  • 2020-12-16 04:36

    this is what you are looking for, this code snippet can convert all the relative URLs to absolute inside any HTML code:

    Private Function ConvertALLrelativeLinksToAbsoluteUri(ByVal html As String, ByVal PageURL As String)
        Dim result As String = Nothing
        ' Getting all Href
        Dim opt As New RegexOptions
        Dim XpHref As New Regex("(href="".*?"")", RegexOptions.IgnoreCase)
        Dim i As Integer
        Dim NewSTR As String = html
        For i = 0 To XpHref.Matches(html).Count - 1
            Application.DoEvents()
            Dim Oldurl As String = Nothing
            Dim OldHREF As String = Nothing
            Dim MainURL As New Uri(PageURL)
            OldHREF = XpHref.Matches(html).Item(i).Value
            Oldurl = OldHREF.Replace("href=", "").Replace("HREF=", "").Replace("""", "")
            Dim NEWURL As New Uri(MainURL, Oldurl)
            Dim NewHREF As String = "href=""" & NEWURL.AbsoluteUri & """"
            NewSTR = NewSTR.Replace(OldHREF, NewHREF)
        Next
        html = NewSTR
        Dim XpSRC As New Regex("(src="".*?"")", RegexOptions.IgnoreCase)
        For i = 0 To XpSRC.Matches(html).Count - 1
            Application.DoEvents()
            Dim Oldurl As String = Nothing
            Dim OldHREF As String = Nothing
            Dim MainURL As New Uri(PageURL)
            OldHREF = XpSRC.Matches(html).Item(i).Value
            Oldurl = OldHREF.Replace("src=", "").Replace("src=", "").Replace("""", "")
            Dim NEWURL As New Uri(MainURL, Oldurl)
            Dim NewHREF As String = "src=""" & NEWURL.AbsoluteUri & """"
            NewSTR = NewSTR.Replace(OldHREF, NewHREF)
        Next
        Return NewSTR
    End Function
    
    0 讨论(0)
  • 2020-12-16 04:39

    You should use HtmlAgility pack to load the HTML, access all the hrefs using it, and then use the Uri class to convert from relative to absolute as necessary.

    See for example http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/

    0 讨论(0)
  • 2020-12-16 04:41

    I think url is of type string. Use Uri instead with a base uri pointing to your domain:

    Uri baseUri = new Uri("http://domain.is");
    Uri myUri = new Uri(baseUri, url);
    
    System.Net.WebClient client = new System.Net.WebClient();
    byte[] dl = client.DownloadData(myUri);
    
    0 讨论(0)
  • 2020-12-16 04:43

    Simple function

    public string ConvertRelativeUrlToAbsoluteUrl(string relativeUrl)
    {
    
    if (Request.IsSecureConnection)
      return string.Format("https://{0}{1}", Request.Url.Host, Page.ResolveUrl(relativeUrl));
    else
      return string.Format("http://{0}{1}", Request.Url.Host, Page.ResolveUrl(relativeUrl));
    
    }
    
    0 讨论(0)
  • 2020-12-16 04:46

    I know this is an older question, but I figured out how to do it with a fairly simple regex. It works well for me. It handles http/https and also root-relative and current directory-relative.

    var host = "http://www.google.com/";
    var baseUrl = host + "images/";
    var html = "<html><head></head><body><img src=\"/images/srpr/logo3w.png\" /><br /><img src=\"srpr/logo3w.png\" /></body></html>";
    var regex = "(?<=(?:href|src)=\")(?!https?://)(?<url>[^\"]+)";
    html = Regex.Replace(
        html,
        regex,
        match => match.Groups["url"].Value.StartsWith("/")
            ? host + match.Groups["url"].Value.Substring(1)
            : baseUrl + match.Groups["url"].Value);
    
    0 讨论(0)
提交回复
热议问题