PHP Regex to determine relative or absolute path

前端 未结 2 366
难免孤独
难免孤独 2021-01-23 17:31

I\'m using cURL to pull the contents of a remote site. I need to check all \"href=\" attributes and determine if they\'re relative or absolute path, then get the value of the li

2条回答
  •  青春惊慌失措
    2021-01-23 18:08

    A combination of a regex* and HTML's parse_url() should help:

    // find all links in a page used within href="" or href='' syntax
    $links = array();
    preg_match_all('/href=(?:(?:"([^"]+)")|(?:\'([^\']+)\'))/i', $page_contents, $links);
    
    // iterate through each array and check if it's "absolute"
    $urls = array();
    foreach ($links as $link) {
        $path = $link;
        if ((substr($link, 0, 7) == 'http://') || (substr($link, 0, 8) == 'https://')) {
            // the current link is an "absolute" URL - parse it to get just the path
            $parsed = parse_url($link);
            $path = $parsed['path'];
        }
        $urls[] = 'http://www.website.com/index.php?url=' . $path;
    }
    

    To determine if the URL is absolute or not, I simply have it check if the beginning of the URL is http:// or https://; if your URLs contain other mediums such as ftp:// or tel:, you might need to handle those as well.

    This solution does use regex to parse HTML, which is often frowned upon. To circumvent, you could switch to using [DOMDocument][2], but there's no need for extra code if there aren't any issues.

提交回复
热议问题