[removed] REGEX to change all relative Urls to Absolute

后端 未结 5 1908
灰色年华
灰色年华 2020-12-02 14:54

I\'m currently creating a Node.js webscraper/proxy, but I\'m having trouble parsing relative Urls found in the scripting part of the source, I figured REGEX would do the tri

5条回答
  •  北荒
    北荒 (楼主)
    2020-12-02 15:37

    Advanced HTML string replacement functions

    Note for OP, because he requested such a function: Change base_url to your proxy's basE URL in order to achieve the desired results.

    Two functions will be shown below (the usage guide is contained within the code). Make sure that you don't skip any part of the explanation of this answer to fully understand the function's behaviour.

    • rel_to_abs(urL) - This function returns absolute URLs. When an absolute URL with a commonly trusted protocol is passed, it will immediately return this URL. Otherwise, an absolute URL is generated from the base_url and the function argument. Relative URLs are correctly parsed (../ ; ./ ; . ; //).
    • replace_all_rel_by_abs - This function will parse all occurences of URLs which have a significant meaning in HTML, such as CSS url(), links and external resources. See the code for a full list of parsed instances. See this answer for an adjusted implementation to sanitise HTML strings from an external source (to embed in the document).
    • Test case (at the bottom of the answer): To test the effectiveness of the function, simply paste the bookmarklet at the location's bar.


    rel_to_abs - Parsing relative URLs

    function rel_to_abs(url){
        /* Only accept commonly trusted protocols:
         * Only data-image URLs are accepted, Exotic flavours (escaped slash,
         * html-entitied characters) are not supported to keep the function fast */
      if(/^(https?|file|ftps?|mailto|javascript|data:image\/[^;]{2,9};):/i.test(url))
             return url; //Url is already absolute
    
        var base_url = location.href.match(/^(.+)\/?(?:#.+)?$/)[0]+"/";
        if(url.substring(0,2) == "//")
            return location.protocol + url;
        else if(url.charAt(0) == "/")
            return location.protocol + "//" + location.host + url;
        else if(url.substring(0,2) == "./")
            url = "." + url;
        else if(/^\s*$/.test(url))
            return ""; //Empty = Return nothing
        else url = "../" + url;
    
        url = base_url + url;
        var i=0
        while(/\/\.\.\//.test(url = url.replace(/[^\/]+\/+\.\.\//g,"")));
    
        /* Escape certain characters to prevent XSS */
        url = url.replace(/\.$/,"").replace(/\/\./g,"").replace(/"/g,"%22")
                .replace(/'/g,"%27").replace(//g,"%3E");
        return url;
    }
    

    Cases / examples:

    • http://foo.bar. Already an absolute URL, thus returned immediately.
    • /doo Relative to the root: Returns the current root + provided relative URL.
    • ./meh Relative to the current directory.
    • ../booh Relative to the parent directory.

    The function converts relative paths to ../, and performs a search-and-replace (http://domain/sub/anything-but-a-slash/../me to http://domain/sub/me).


    replace_all_rel_by_abs - Convert all relevant occurences of URLs
    URLs inside script instances (
提交回复
热议问题