Get Root Domain of Link

后端 未结 7 1527
半阙折子戏
半阙折子戏 2021-01-17 08:13

I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?

7条回答
  •  长情又很酷
    2021-01-17 08:55

    def get_domain(url):
        u = urlsplit(url)
        return u.netloc
    
    def get_top_domain(url):
        u"""
        >>> get_top_domain('http://www.google.com')
        'google.com'
        >>> get_top_domain('http://www.sina.com.cn')
        'sina.com.cn'
        >>> get_top_domain('http://bbc.co.uk')
        'bbc.co.uk'
        >>> get_top_domain('http://mail.cs.buaa.edu.cn')
        'buaa.edu.cn'
        """
        domain = get_domain(url)
        domain_parts = domain.split('.')
        if len(domain_parts) < 2:
            return domain
        top_domain_parts = 2
        # if a domain's last part is 2 letter long, it must be country name
        if len(domain_parts[-1]) == 2:
            if domain_parts[-1] in ['uk', 'jp']:
                if domain_parts[-2] in ['co', 'ac', 'me', 'gov', 'org', 'net']:
                    top_domain_parts = 3
            else:
                if domain_parts[-2] in ['com', 'org', 'net', 'edu', 'gov']:
                    top_domain_parts = 3
        return '.'.join(domain_parts[-top_domain_parts:])
    

提交回复
热议问题