Scrape FULL image src with PHP

后端 未结 2 1391
忘掉有多难
忘掉有多难 2021-01-01 07:12

I am trying to scrape img src\'s with php, I can get the src fine, but if the src does not include the full path then I can\'t really reuse it. Is there a way to grab the

相关标签:
2条回答
  • 2021-01-01 07:51

    You don't need a regex... just some patience. I don't really want to write the code for you, but just check if the src starts with http://, and if not, you have like 3 different cases.

    1. If it begins with a / then prepend http://domain.com
    2. If it begins with .. you'll have to split the full URL and hack off pieces until the src starts with a /
    3. Else (it begins with a letter), the take the full domain, and strip it down to the last slash then append the src URL.

    Or.... be lazy and steal this script

    $url = "http://www.goat.com/money/dave.html";
    $rel = "../images/cheese.jpg";
    
    $com = InternetCombineURL($url,$rel);
    
    //  Returns http://www.goat.com/images/cheese.jpg
    
    function InternetCombineUrl($absolute, $relative) {
        $p = parse_url($relative);
        if($p["scheme"])return $relative;
    
        extract(parse_url($absolute));
    
        $path = dirname($path); 
    
        if($relative{0} == '/') {
            $cparts = array_filter(explode("/", $relative));
        }
        else {
            $aparts = array_filter(explode("/", $path));
            $rparts = array_filter(explode("/", $relative));
            $cparts = array_merge($aparts, $rparts);
            foreach($cparts as $i => $part) {
                if($part == '.') {
                    $cparts[$i] = null;
                }
                if($part == '..') {
                    $cparts[$i - 1] = null;
                    $cparts[$i] = null;
                }
            }
            $cparts = array_filter($cparts);
        }
        $path = implode("/", $cparts);
        $url = "";
        if($scheme) {
            $url = "$scheme://";
        }
        if($user) {
            $url .= "$user";
            if($pass) {
                $url .= ":$pass";
            }
            $url .= "@";
        }
        if($host) {
            $url .= "$host/";
        }
        $url .= $path;
        return $url;
    }
    

    From http://www.web-max.ca/PHP/misc_24.php

    0 讨论(0)
  • 2021-01-01 08:10

    Unless you have the site URL you're starting with (in which case you can prepend it to the value of the src attribute) it seems like all you're left with there is a string.

    I'm assuming you don't have access to any additional information of course. If you're parsing HTML, I'd assume you must be able to access an absolute URL to at least the HTML page, but perhaps not.

    0 讨论(0)
提交回复
热议问题