How to find the domain is whether HTTP or HTTPS (with or without WWW) using PHP?

寵の児 提交于 2019-12-13 22:26:04

问题


I have million (1,000,000) domains list.

+----+--------------+--------------------------+
| Id | Domain_Name  |       Correct_URL        |
+----+--------------+--------------------------+
|  1 | example1.com | http://www.example1.com  |
|  2 | example2.com | https://exmple2.com      |
|  3 | example3.com | https://www.example3.com |
|  3 | example4.com | http://example4.com      |
+----+--------------+--------------------------+
  • ID and Domain_Name column is filled.
  • Correct_URL column is empty.

Question : I need to fill the Correct_URL column.

The problem I face is how do I find the prefix part before the domain. It may either http:// or http://www. or https:// or https://www.

How do I find correctly what is from above 4 using PHP? Please note that I need to run code to all 1,000,000 domains.... So I am looking at a fastest way to check it...


回答1:


You could use cURL method:

$url_list = ['facebook.com','google.com'];

foreach($url_list as $url){

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
    curl_exec($ch);

    $real_url =  curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
    echo $real_url;//add here your db commands

}

This one take some times because it take the last redirected url. if you only want to check whether its http or https you could try this:

$url_list = ['facebook.com','google.com'];

foreach($url_list as $url){

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_exec($ch);

    $real_url =  curl_getinfo($ch, CURLINFO_REDIRECT_URL);
    echo $real_url;//add here your db commands

}



回答2:


There isn't really any way other than making an HTTP request to each of the possibilities and see if you get a response.

While you assert "It may either http:// or http://www. or https:// or https://www.", real world domains may provide zero, some or all or those (as well as various others) and they may respond to requests with OKs or redirects or authentication errors, etc.

HTTP and HTTPS are not attributes of a web application; they are communication protocols handled by the endpoint (the web server, or an application firewall, etc.).

As with any network communications, one must probe the host ("www" is the host in this case), and the port (not necessarily, but most commonly) port 80 and 443 respectively. This probing is a shout, then you wait and see if there is a service listening on the other side.




回答3:


Given a known url you could make a call to http and/or https versions with get_headers, from their you can determine if https is available, if http redirects to https and so on.

Details can be found here: http://php.net/manual/en/function.get-headers.php




回答4:


So I have had to build a system similar in that we verify user-supplied URLs.

In the end, you need to set an order of priority the recommended order is HTTPS over HTTP and with WWW over without so you end up with the priority list like:

  • https://www.example.com
  • https://example.com
  • http://www.example.com
  • http://example.com

As everyone else has said you will need to test for these using cURL.

foreach($domainRows as $domainRow){
    $scheme_list = ['https://www.','https://', 'http://www.', 'http://'];
    $bestUrl = false;
    foreach($scheme_list as $scheme){

        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $scheme.$domainRow['Domain_Name']);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
        curl_exec($ch);

        $real_url =  curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
        if($real_url){
            $bestUrl = $scheme.$domainRow['Domain_Name']
            break;
        }
    }


    if($bestUrl){
        // you have the best URL to use as $bestUrl save it to your DB Row
    }else{
        // the site is not responding to any URL's do you need to do something here?
    }

}

Or based on Alexander Holman's answer which I completely forgot about get_headers you can do

foreach($domainRows as $domainRow){
    $scheme_list = ['https://www.','https://', 'http://www.', 'http://'];
    $bestUrl = false;
    foreach($scheme_list as $scheme){

        $res = get_headers($scheme.$domainRow['Domain_Name']);
        // if you want to allow redirects remove/alter this part as it blocks them.
        if($res && isset($res[0])){
            $statusParts = explode(" ", $res[0]);
            if($statusParts[1] == "200"){
                $bestUrl = $scheme.$domainRow['Domain_Name'];
                break;
            }
        }
        //end of status check
        //replace with below to allow all responses from server including 404
        /*if($res){
            $bestUrl = $scheme.$domainRow['Domain_Name'];
            break;
        }*/
    }


    if($bestUrl){
        // you have the best URL to use as $bestUrl save it to your DB Row
    }else{
        // the site is not responding to any URL's do you need to do something here?
    }

}

This code will test in the order of priority and the first one it matches it will stop testing for the others, and if it does not find a working system for it will tell you that.

With thanks to Supun Praneeth as I have taken and augmented there code to better suit your needs.



来源:https://stackoverflow.com/questions/50873547/how-to-find-the-domain-is-whether-http-or-https-with-or-without-www-using-php

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!