PHP has the parse_url() function that will help you do the basic splitting into protocol, host, port, and so on.
As to extracting the "right" domain in uncertain cases, this is extremely hard to tell because sometimes, "two-part TLDs" are a measure by the TLD authority (e.g. in the UK) and sometimes are private enterprises (e.g. .uk.com
). I think you won't get around maintaining lists of top level domains that have two parts like
those endings would be treated like TLDs (Top level domains), swallowing the second part.
This is the only way of reliably telling apart "two-part TLDs" like .co.uk
- where server1.ibm.co.uk
(where the two-part .co.uk
needs to be removed to determine the domain itself) from regular sub-domains like server1.ibm.com
(where .com
needs to be removed).
A good starting point to get a list of many important "two-part TLDs" is the domain search at speednames.com (select "all" in countries). A more complete list can be found as part of the Ruby domainatrix library.