Get Root Domain of Link

后端未结

关注

 7  1521

半阙折子戏 2021-01-17 08:13

I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?

7条回答

我在风中等你 (楼主)

2021-01-17 09:04

General structure of URL:

scheme://netloc/path;parameters?query#fragment

As TIMTOWTDI motto:

Using urlparse,

>>> from urllib.parse import urlparse  # python 3.x
>>> parsed_uri = urlparse('http://www.stackoverflow.com/questions/41899120/whatever')  # returns six components
>>> domain = '{uri.netloc}/'.format(uri=parsed_uri)
>>> result = domain.replace('www.', '')  # as per your case
>>> print(result)
'stackoverflow.com/'

Using tldextract,

>>> import tldextract  # The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

in your case:

>>> extracted = tldextract.extract('http://www.techcrunch.com/')
>>> '{}.{}'.format(extracted.domain, extracted.suffix)
'techcrunch.com'

tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

Cheerio! :)

0 讨论(0)

查看其它7个回答