PHP Curl following redirects

后端 未结 2 949
[愿得一人]
[愿得一人] 2020-12-15 10:23

I\'m trying to be a bit sneeky and as part of a learning process try and improve my page scraping skills.

One thing i\'ve come across that I have yet to be able to s

相关标签:
2条回答
  • 2020-12-15 10:51

    If you can't use CURLOPT_FOLLOWLOCATION, I suggest you use a recursive method like this one:

    function getUrl($url, $count) {
    
        // max number of redirects
        if ($count > 5) {
            return false;
        }
    
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_HEADER, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    
        $data = curl_exec($ch);
        $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    
        curl_close($ch);
    
        if (!$data) {
            return false;
        }
    
        $dataArray = explode("\r\n\r\n", $data, 2);
    
        if (count($dataArray) != 2) {
            return false;
        }
    
        list($header, $body) = $dataArray;
        if ($httpCode == 301 || $httpCode == 302) {
            $matches = array();
            preg_match('/Location:(.*?)\n/', $header, $matches);
    
            if (isset($matches[1])) {
                return getUrl(trim($matches[1]), $count + 1);
            }
        } else {
            return $body;
        }
    }
    
    0 讨论(0)
  • 2020-12-15 11:07

    http.//php.net/manual/en/ref.curl.php

       function get_final_url( $url, $timeout = 5 )
     {
        $url = str_replace( "&", "&", urldecode(trim($url)) );
    
       $cookie = tempnam ("/tmp", "CURLCOOKIE");
    $ch = curl_init();
    curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
    curl_setopt( $ch, CURLOPT_URL, $url );
    curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
    curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
    curl_setopt( $ch, CURLOPT_ENCODING, "" );
    curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
    curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
    curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
    curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
    curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
    $content = curl_exec( $ch );
    $response = curl_getinfo( $ch );
    curl_close ( $ch );
    
    if ($response['http_code'] == 301 || $response['http_code'] == 302)
    {
        ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");
        $headers = get_headers($response['url']);
    
        $location = "";
        foreach( $headers as $value )
        {
            if ( substr( strtolower($value), 0, 9 ) == "location:" )
                return get_final_url( trim( substr( $value, 9, strlen($value) ) ) );
        }
    }
    
    if (    preg_match("/window\.location\.replace\('(.*)'\)/i", $content, $value) ||
            preg_match("/window\.location\=\"(.*)\"/i", $content, $value)
    )
    {
        return get_final_url ( $value[1] );
    }
    else
    {
        return $response['url'];
       }
    }
    
    0 讨论(0)
提交回复
热议问题