Extracting top-level and second-level domain from a URL using regex

后端未结

关注

 9  796

How can I extract only top-level and second-level domain from a URL using regex? I want to skip all lower level domains. Any ideas?

相关标签:

9条回答

小蘑菇

2020-12-05 08:46
Also, you can likely do that with some expression similar to,
```
^(?:https?:\/\/)(?:w{3}\.)?.*?([^.\r\n\/]+\.)([^.\r\n\/]+\.[^.\r\n\/]{2,6}(?:\.[^.\r\n\/]{2,6})?).*$
```
and add as much as capturing groups that you want to capture the components of a URL.

Demo

If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

RegEx Circuit

jex.im visualizes regular expressions:
0 讨论(0)
发布评论:

提交评论
- 加载中...
长情又很酷

2020-12-05 08:49
For anyone using JavaScript and wanting a simple way to extract the top and second level domains, I ended up doing this:
```
'example.aus.com'.match(/\.\w{2,3}\b/g).join('')
```
This matches anything with a period followed by two or three characters and then a word boundary.

Here's some example outputs:
```
'example.aus.com'       // .aus.com
'example.austin.com'    // .austin.com
'example.aus.com/howdy' // .aus.com
'example.co.uk/howdy'   // .co.uk
```
Some people might need something a bit cleverer, but this was enough for me with my particular dataset.

Edit

I've realised there are actually quite a few second-level domains which are longer than 3 characters (and allowed). So, again for simplicity, I just removed the character counting element of my regex:
```
'example.aus.com'.match(/\.\w*\b/g).join('')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
野趣味

2020-12-05 08:51
Updated 2019

This is an old question, and the challenge here is a lot more complicated as we start adding new vanity TLDs and more ccTLD second level domains (e.g. .co.uk, .org.uk). So much so, that a regular expression is almost guaranteed to return false positives or negatives.

The only way to reliably get the primary host is to call out to a service that knows about them, like the Public Suffix List.

There are several open-source libraries out there that you can use, like psl, or you can write your own.

Usage for psl is quite intuitive. From their docs:
```
var psl = require('psl');

// Parse domain without subdomain
var parsed = psl.parse('google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // null

// Parse domain with subdomain
var parsed = psl.parse('www.google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // 'www'

// Parse domain with nested subdomains
var parsed = psl.parse('a.b.c.d.foo.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'foo'
console.log(parsed.domain); // 'foo.com'
console.log(parsed.subdomain); // 'a.b.c.d'
```
Old answer

You could use this:
```
(\w+\.\w+)$
```
Without more details (a sample file, the language you're using), it's hard to discern exactly whether this will work.

Example: http://regex101.com/r/wD8eP2
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2

Extracting top-level and second-level domain from a URL using regex

Demo

RegEx Circuit

Updated 2019