Python urlparse — extract domain name without subdomain

前端 未结 7 1032
南笙
南笙 2020-12-01 02:30

Need a way to extract a domain name without the subdomain from a url using Python urlparse.

For example, I would like to extract \"google.com\" from a f

7条回答
  •  慢半拍i
    慢半拍i (楼主)
    2020-12-01 03:17

    This is not a standard decomposition of the URLs.

    You cannot rely on the www. to be present or optional. In a lot of cases it will not.

    So if you do want to assume that only the last two components are relevant (which also won't work for the uk, e.g. www.google.co.uk) then you can do a split('.')[-2:].

    Or, which is actually less error prone, strip a www. prefix.

    But in either way you cannot assume that the www. is optional, because it will NOT work every time!

    Here is a list of common suffixes for domains. You can try to keep the suffix + one component.

    https://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

    But how do you plan to handle for example first.last.name domains? Assume that all the users with the same last name are the same company? Initially, you would only be able to get third-level domains there. By now, you apparently can get second level, too. So for .name there is no general rule.

提交回复
热议问题