Extracting top-level and second-level domain from a URL using regex

后端 未结 9 796
误落风尘
误落风尘 2020-12-05 08:02

How can I extract only top-level and second-level domain from a URL using regex? I want to skip all lower level domains. Any ideas?

相关标签:
9条回答
  • 2020-12-05 08:46

    Also, you can likely do that with some expression similar to,

    ^(?:https?:\/\/)(?:w{3}\.)?.*?([^.\r\n\/]+\.)([^.\r\n\/]+\.[^.\r\n\/]{2,6}(?:\.[^.\r\n\/]{2,6})?).*$
    

    and add as much as capturing groups that you want to capture the components of a URL.

    Demo


    If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


    RegEx Circuit

    jex.im visualizes regular expressions:

    0 讨论(0)
  • 2020-12-05 08:49

    For anyone using JavaScript and wanting a simple way to extract the top and second level domains, I ended up doing this:

    'example.aus.com'.match(/\.\w{2,3}\b/g).join('')
    

    This matches anything with a period followed by two or three characters and then a word boundary.

    Here's some example outputs:

    'example.aus.com'       // .aus.com
    'example.austin.com'    // .austin.com
    'example.aus.com/howdy' // .aus.com
    'example.co.uk/howdy'   // .co.uk
    

    Some people might need something a bit cleverer, but this was enough for me with my particular dataset.

    Edit

    I've realised there are actually quite a few second-level domains which are longer than 3 characters (and allowed). So, again for simplicity, I just removed the character counting element of my regex:

    'example.aus.com'.match(/\.\w*\b/g).join('')
    
    0 讨论(0)
  • 2020-12-05 08:51

    Updated 2019

    This is an old question, and the challenge here is a lot more complicated as we start adding new vanity TLDs and more ccTLD second level domains (e.g. .co.uk, .org.uk). So much so, that a regular expression is almost guaranteed to return false positives or negatives.

    The only way to reliably get the primary host is to call out to a service that knows about them, like the Public Suffix List.

    There are several open-source libraries out there that you can use, like psl, or you can write your own.

    Usage for psl is quite intuitive. From their docs:

    var psl = require('psl');
    
    // Parse domain without subdomain
    var parsed = psl.parse('google.com');
    console.log(parsed.tld); // 'com'
    console.log(parsed.sld); // 'google'
    console.log(parsed.domain); // 'google.com'
    console.log(parsed.subdomain); // null
    
    // Parse domain with subdomain
    var parsed = psl.parse('www.google.com');
    console.log(parsed.tld); // 'com'
    console.log(parsed.sld); // 'google'
    console.log(parsed.domain); // 'google.com'
    console.log(parsed.subdomain); // 'www'
    
    // Parse domain with nested subdomains
    var parsed = psl.parse('a.b.c.d.foo.com');
    console.log(parsed.tld); // 'com'
    console.log(parsed.sld); // 'foo'
    console.log(parsed.domain); // 'foo.com'
    console.log(parsed.subdomain); // 'a.b.c.d'
    

    Old answer

    You could use this:

    (\w+\.\w+)$
    

    Without more details (a sample file, the language you're using), it's hard to discern exactly whether this will work.

    Example: http://regex101.com/r/wD8eP2

    0 讨论(0)
提交回复
热议问题