How to extract top-level domain name (TLD) from URL

匿名 (未验证) 提交于 2019-12-03 01:48:02

问题:

how would you extract the domain name from a URL, excluding any subdomains?

My initial simplistic attempt was:

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:]) 

This works for http://www.foo.com, but not http://www.foo.com.au. Is there a way to do this properly without using special knowledge about valid TLDs (Top Level Domains) or country codes (because they change).

thanks

回答1:

No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it is a subdomain (because Italy's registrar DOES sell domains such as co.it) while zap.co.uk isn't (because the UK's registrar DOESN'T sell domains such as co.uk, but only like zap.co.uk).

You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).



回答2:

Using this file of effective tlds which someone else found on Mozilla's website:

from __future__ import with_statement from urlparse import urlparse  # load tlds, ignore comments and empty lines: with open("effective_tld_names.dat.txt") as tld_file:     tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]  def get_domain(url, tlds):     url_elements = urlparse(url)[1].split('.')     # url_elements = ["abcde","co","uk"]      for i in range(-len(url_elements), 0):         last_i_elements = url_elements[i:]         #    i=-3: ["abcde","co","uk"]         #    i=-2: ["co","uk"]         #    i=-1: ["uk"] etc          candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk         wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *         exception_candidate = "!" + candidate          # match tlds:          if (exception_candidate in tlds):             return ".".join(url_elements[i:])          if (candidate in tlds or wildcard_candidate in tlds):             return ".".join(url_elements[i-1:])             # returns "abcde.co.uk"      raise ValueError("Domain not in global list of TLDs")  print get_domain("http://abcde.co.uk", tlds) 

results in:

abcde.co.uk 

I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the last_i_elements list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?



回答3:

Here's a great python module someone wrote to solve this problem after seeing this question: https://github.com/john-kurkowski/tldextract

The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers

Quote:

tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.



回答4:

Using python tld

https://pypi.python.org/pypi/tld

$ pip install tld

from tld import get_tld print get_tld("http://www.google.co.uk")  

google.co.uk

or without protocol:

from tld import get_tld  get_tld("www.google.co.uk", fix_protocol=True) 

google.co.uk



回答5:

There are many, many TLD's. Here's the list:

http://data.iana.org/TLD/tlds-alpha-by-domain.txt

Here's another list

http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

Here's another list

http://www.iana.org/domains/root/db/



回答6:

Here's how I handle it:

if not url.startswith('http'):     url = 'http://'+url website = urlparse.urlparse(url)[1] domain = ('.').join(website.split('.')[-2:]) match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I) if not match:     sys.exit(2) elif not match.group(0):     sys.exit(2) 


回答7:

Until get_tld is updated for all the new ones, I pull the tld from the error. Sure it's bad code but it works.

def get_tld():   try:     return get_tld(self.content_url)   except Exception, e:     re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");     matchObj = re_domain.findall(str(e))     if matchObj:       for m in matchObj:         return m     raise e 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!