可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
how would you extract the domain name from a URL, excluding any subdomains?
My initial simplistic attempt was:
'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])
This works for http://www.foo.com, but not http://www.foo.com.au. Is there a way to do this properly without using special knowledge about valid TLDs (Top Level Domains) or country codes (because they change).
thanks
回答1:
No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it
is a subdomain (because Italy's registrar DOES sell domains such as co.it
) while zap.co.uk
isn't (because the UK's registrar DOESN'T sell domains such as co.uk
, but only like zap.co.uk
).
You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).
回答2:
Using this file of effective tlds which someone else found on Mozilla's website:
from __future__ import with_statement from urlparse import urlparse # load tlds, ignore comments and empty lines: with open("effective_tld_names.dat.txt") as tld_file: tlds = [line.strip() for line in tld_file if line[0] not in "/\n"] def get_domain(url, tlds): url_elements = urlparse(url)[1].split('.') # url_elements = ["abcde","co","uk"] for i in range(-len(url_elements), 0): last_i_elements = url_elements[i:] # i=-3: ["abcde","co","uk"] # i=-2: ["co","uk"] # i=-1: ["uk"] etc candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, * exception_candidate = "!" + candidate # match tlds: if (exception_candidate in tlds): return ".".join(url_elements[i:]) if (candidate in tlds or wildcard_candidate in tlds): return ".".join(url_elements[i-1:]) # returns "abcde.co.uk" raise ValueError("Domain not in global list of TLDs") print get_domain("http://abcde.co.uk", tlds)
results in:
abcde.co.uk
I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the last_i_elements
list, but I couldn't think of one. I also don't know if ValueError
is the best thing to raise. Comments?
回答3:
Here's a great python module someone wrote to solve this problem after seeing this question: https://github.com/john-kurkowski/tldextract
The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
Quote:
tldextract
on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.
回答4:
Using python tld
https://pypi.python.org/pypi/tld
$ pip install tld
from tld import get_tld print get_tld("http://www.google.co.uk")
google.co.uk
or without protocol:
from tld import get_tld get_tld("www.google.co.uk", fix_protocol=True)
google.co.uk
回答5:
回答6:
Here's how I handle it:
if not url.startswith('http'): url = 'http://'+url website = urlparse.urlparse(url)[1] domain = ('.').join(website.split('.')[-2:]) match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I) if not match: sys.exit(2) elif not match.group(0): sys.exit(2)
回答7:
Until get_tld is updated for all the new ones, I pull the tld from the error. Sure it's bad code but it works.
def get_tld(): try: return get_tld(self.content_url) except Exception, e: re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!"); matchObj = re_domain.findall(str(e)) if matchObj: for m in matchObj: return m raise e