Need a way to extract a domain name without the subdomain from a url using Python urlparse.
For example, I would like to extract \"google.com\" from a f
This is an update, based on the bounty request for an updated answer
Start by using the tld package. A description of the package:
Extracts the top level domain (TLD) from the URL given. List of TLD names is taken from Mozilla http://mxr.mozilla.org/mozilla/source/netwerk/dns/src/effective_tld_names.dat?raw=1
from tld import get_tld
from tld.utils import update_tld_names
update_tld_names()
print get_tld("http://www.google.co.uk")
print get_tld("http://zap.co.it")
print get_tld("http://google.com")
print get_tld("http://mail.google.com")
print get_tld("http://mail.google.co.uk")
print get_tld("http://google.co.uk")
This outputs
google.co.uk
zap.co.it
google.com
google.com
google.co.uk
google.co.uk
Notice that it correctly handles country level TLDs by leaving co.uk and co.it, but properly removes the www and mail subdomains for both .com and .co.uk
The update_tld_names() call at the beginning of the script is used to update/sync the tld names with the most recent version from Mozilla.