Get Root Domain of Link

后端 未结 7 1521
半阙折子戏
半阙折子戏 2021-01-17 08:13

I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?

7条回答
  •  我在风中等你
    2021-01-17 09:04

    General structure of URL:

    scheme://netloc/path;parameters?query#fragment

    As TIMTOWTDI motto:

    Using urlparse,

    >>> from urllib.parse import urlparse  # python 3.x
    >>> parsed_uri = urlparse('http://www.stackoverflow.com/questions/41899120/whatever')  # returns six components
    >>> domain = '{uri.netloc}/'.format(uri=parsed_uri)
    >>> result = domain.replace('www.', '')  # as per your case
    >>> print(result)
    'stackoverflow.com/'  
    

    Using tldextract,

    >>> import tldextract  # The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
    >>> tldextract.extract('http://forums.news.cnn.com/')
    ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
    

    in your case:

    >>> extracted = tldextract.extract('http://www.techcrunch.com/')
    >>> '{}.{}'.format(extracted.domain, extracted.suffix)
    'techcrunch.com'
    

    tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

    Cheerio! :)

提交回复
热议问题