Mysql query to extract domains from urls

前端 未结 12 2114
小鲜肉
小鲜肉 2020-12-08 08:21

sorry for my english

i have this query to extract domain from urls

SELECT SUBSTRING(LEFT(url, LOCATE(\'/\', url, 8) - 1), 8) AS domain...


        
相关标签:
12条回答
  • 2020-12-08 08:39

    All the answers do not seem to work for me. For example, a url that starts variables with a ? fails for some answers. This works for me for all kinds of URLs:

    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REPLACE(REPLACE(LOWER(url), 'https://', ''), 'http://', ''), '/', 1), '?', 1) AS domain;
    
    0 讨论(0)
  • 2020-12-08 08:42

    I tried multiple examples on this page (and some documentation) docs to make the following, annotated, version. It appears impossible with SUBSTRING_INDEX to remove subdomains without removing *.co.uk type domains.

    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(url, 
    '?', 1), # split on url params to remove weirdest stuff first 
    '://', -1), # remove protocal http:// https:// ftp:// ...
    '/', 1), # split on path 
    ':', 2), # split on user:pass
    '@', 1), # split on user:port@
    ':', 1), # split on port
    'www.', -1), # remove www.
    '.', 4), # keep TLD + domain name
    '/', 1) 
    AS domain
    FROM ( 
        SELECT       'http://test.com' as url 
        UNION SELECT 'https://test.com' 
        UNION SELECT 'http://test.com/one' 
        UNION SELECT 'http://test.com/?huh' 
        UNION SELECT 'http://www.test1.test.com?http://ouch.foo' 
        UNION SELECT 'test.com' 
        UNION SELECT 'test.com/one'
        UNION SELECT 'test.com/one/two'
        UNION SELECT 'test.com/one/two/three'
        UNION SELECT 'test.com/one/two/three?u=http://maaaaannn'
        UNION SELECT 'http://one.test.com'
        UNION SELECT 'one.test.com/one'
        UNION SELECT 'https://www.bbc.co.uk/'
        UNION SELECT 'http://a.very.complex-domain.co.uk:8080/foo/bar'
        UNION SELECT 'postgres://user:pass@host.com:5432/path?k=v#f'
        UNION SELECT 'http://10.64.3.5/data_check/index.php?r=index/rawdatacheck'
        UNION SELECT 'two.one.test.com/one' ) AS test; 
    
    0 讨论(0)
  • 2020-12-08 08:47

    If you want to find the first occurrence of / why are you passing 8 into LOCATE as the starting position? (I assume to skip slashes after the protocol such as http:// but if you also expect url's without a protocol, consider short ones like cnn.com/page)

    IF(LOCATE('/', url) > 0, SUBSTRING(url, 1, LOCATE('/', url) - 1), url) AS domain
    

    Alternatively:

    SUBSTRING_INDEX(url, '/', 1) AS domain
    

    The latter seems easier to me.

    0 讨论(0)
  • 2020-12-08 08:47

    This one works well on my very messy data:

    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REPLACE(url, '//', '.'), '/', 1), '.', -2) AS Domain
    
    0 讨论(0)
  • 2020-12-08 08:47

    If you're not afraid of installing MySQL extensions (UDFs), then there's a UDF you can use that does exactly this while respecting different top-level domains like "google.com" and "google.co.uk"

    https://github.com/StirlingMarketingGroup/mysql-get-etld-p1

    select`get_etld_p1`('http://a.very.complex-domain.co.uk:8080/foo/bar');-- 'complex-domain.co.uk'
    select`get_etld_p1`('https://www.bbc.co.uk/');-- 'bbc.co.uk'
    select`get_etld_p1`('https://github.com/StirlingMarketingGroup/');-- 'github.com'
    select`get_etld_p1`('https://localhost:10000/index');-- 'localhost'
    select`get_etld_p1`('android-app://com.google.android.gm');-- 'com.google.android.gm'
    select`get_etld_p1`('example.test.domain.com');-- 'domain.com'
    select`get_etld_p1`('postgres://user:pass@host.com:5432/path?k=v#f');-- 'host.com'
    select`get_etld_p1`('exzvk.omsk.so-ups.ru');-- 'so-ups.ru'
    select`get_etld_p1`('http://10.64.3.5/data_check/index.php?r=index/rawdatacheck');-- '10.64.3.5'
    select`get_etld_p1`('not a domain');-- null
    
    0 讨论(0)
  • 2020-12-08 08:50

    I had to combine some of the previous answers , plus a little more hackery for my data set . This is what works for me , it returns the domain and any sub-domains:

    SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(target_url, '/', 3), '://', -1), '/', 1), '?', 1) AS domain
    

    Explanation ( cause non-trivial SQL rarely makes sense ):

    SUBSTRING_INDEX(target_url, '/', 3) - strips any path if the url has a protocol
    SUBSTRING_INDEX(THAT, '://', -1) - strips any protocol from THAT
    SUBSTRING_INDEX(THAT, '/', 1) - strips any path from THAT ( if there was no protocol )
    SUBSTRING_INDEX(THAT, '?', 1) - strips the query string from THAT ( if there was no path or trailing / )

    Test Cases:

    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(target_url, '/', 3), '://', -1), '/', 1), '?', 1) AS domain
    FROM ( 
        SELECT       'http://test.com' as target_url 
        UNION SELECT 'https://test.com' 
        UNION SELECT 'http://test.com/one' 
        UNION SELECT 'http://test.com/?huh' 
        UNION SELECT 'http://test.com?http://ouch.foo' 
        UNION SELECT 'test.com' 
        UNION SELECT 'test.com/one'
        UNION SELECT 'test.com/one/two'
        UNION SELECT 'test.com/one/two/three'
        UNION SELECT 'test.com/one/two/three?u=http://maaaaannn'
        UNION SELECT 'http://one.test.com'
        UNION SELECT 'one.test.com/one'
        UNION SELECT 'two.one.test.com/one' ) AS Test; 
    

    Results:

    'test.com'
    'test.com'
    'test.com'
    'test.com'
    'test.com'
    'test.com'
    'test.com'
    'test.com'
    'test.com'
    'test.com'
    'one.test.com'
    'one.test.com'
    'two.one.test.com'
    
    0 讨论(0)
提交回复
热议问题