Javascript/Regex for finding just the root domain name without sub domains

此生再无相见时 提交于 2019-12-19 05:16:42

问题


I had a search and found lot's of similar regex examples, but not quite what I need.

I want to be able to pass in the following urls and return the results:

  • www.google.com returns google.com

  • sub.domains.are.cool.google.com returns google.com

  • doesntmatterhowlongasubdomainis.idont.wantit.google.com returns google.com

  • sub.domain.google.com/no/thanks returns google.com

Hope that makes sense :) Thanks in advance!-James


回答1:


You can't do this with a regular expression because you don't know how many blocks are in the suffix.

For example google.com has a suffix of com. To get from subdomain.google.com to google.com you'd have to take the last two blocks - one for the suffix and one for google.

If you apply this logic to subdomain.google.co.uk though you would end up with co.uk.

You will actually need to look up the suffix from a list like http://publicsuffix.org/




回答2:


Don't use regex, use the .split() method and work from there.

var s = domain.split('.');

If your use case is fairly narrow you could then check the TLDs as needed, and then return the last 2 or 3 segments as appropriate:

return s.slice(-2).join('.');

It'll make your eyes bleed less than any regex solution.




回答3:


I've not done a lot of testing on this, but if I understand what you're asking for, this should be a decent starting point...

([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b

EDIT:

To clarify, it's looking for:

one or more alpha-numeric characters or dashes, followed by a literal dot

and then one of three things...

  1. three or more alpha characters (i.e. com/net/mil/coop, etc.)
  2. two alpha characters, followed by a literal dot, followed by two more alphas (i.e. co.uk)
  3. two alpha characters (i.e. us/uk/to, etc)

and at the end of that, a word boundary (\b) meaning the end of the string, a space, or a non-word character (in regex word characters are typically alpha-numerics, and underscore).

As I say, I didn't do much testing, but it seemed a reasonable jumping off point. You'd likely need to try it and tune it some, and even then, it's unlikely that you'll get 100% for all test cases. There are considerations like Unicode domain names and all sorts of technically-valid-but-you'll-likely-not-encounter-in-the-wild things that'll trip up a simple regex like this, but this'll probably get you 90%+ of the way there.




回答4:


If you have limited subset of data, I suggest to keep the regex simple, e.g.

(([a-z\-]+)(?:\.com|\.fr|\.co.uk))

This will match:

www.google.com --> google.com
www.google.co.uk --> google.co.uk
www.foo-bar.com --> foo-bar.com

In my case, I know that all relevant URLs will be matched using this regex.

Collect a sample dataset and test it against your regex. While prototyping, you can do that using a tool such https://regex101.com/r/aG9uT0/1. In development, automate it using a test script.




回答5:


Without testing the validity of top level domain, I'm using an adaptation of stormsweeper's solution:

domain = 'sub.domains.are.cool.google.com'

s = domain.split('.')

tld = s.slice(-2..-1).join('.')


来源:https://stackoverflow.com/questions/3439863/javascript-regex-for-finding-just-the-root-domain-name-without-sub-domains

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!