I am using C# and ASP.NET for this.
We receive a lot of \"strange\" requests on our IIS 6.0 servers and I want to log and catalog these by domain.
Eg. we get
Use a regular expression:
^https?://([\w./]+[^.])?\.?(\w+\.(com)|(co.uk)|(com.au))$
This will match any URL ending with a TLD in which you are interested. Extend the list for as many as you want. Further, the capturing groups will contain the subdomain, hostname and TLD respectively.
You can use the following nuget Nager.PublicSuffix package. It uses the same data source that browser vendors use.
nuget
PM> Install-Package Nager.PublicSuffix
Example
var domainParser = new DomainParser(new WebTldRuleProvider());
var domainName = domainParser.Get("sub.test.co.uk");
//domainName.Domain = "test";
//domainName.Hostname = "sub.test.co.uk";
//domainName.RegistrableDomain = "test.co.uk";
//domainName.SubDomain = "sub";
//domainName.TLD = "co.uk";
I've written a library for use in .NET 2+ to help pick out the domain components of a URL.
More details are on github but one benefit over previous options is that it can download the latest data from http://publicsuffix.org automatically (once per month) so the output from the library should be more-or-less on a par with the output used by web browsers to establish domain security boundaries (i.e. pretty good).
It's not perfect yet but suits my needs and shouldn't take much work to adapt to other use cases so please fork and send a pull request if you want.
This is not possible without a up-to-date database of different domain levels.
Consider:
s1.moh.gov.cn
moh.gov.cn
s1.google.com
google.com
Then at which level you want to get the domain? It's completely depends of the TLD
, SLD
, ccTLD
... because ccTLD
in under control of countries they may define very special SLD
which is unknown to you.
uri.Host.ToLower().Replace("www.","").Substring(uri.Host.ToLower().Replace("www.","").IndexOf('.'))
returns ".com" for
Uri uri = new Uri("http://stackoverflow.com/questions/4643227/top-level-domain-from-url-in-c");
returns ".co.jp" for
Uri uri = new Uri("http://stackoverflow.co.jp");
returns ".s1.moh.gov.cn" for
Uri uri = new Uri("http://stackoverflow.s1.moh.gov.cn");
etc.
There may be some examples where this returns something other than what is desired, but country codes are the only ones that are 2 characters, and they may or may not have a short second level (2 or 3 characters) typically used. Therefore, this will give you what you want in most cases:
string GetRootDomain(string host)
{
string[] domains = host.Split('.');
if (domains.Length >= 3)
{
int c = domains.Length;
// handle international country code TLDs
// www.amazon.co.uk => amazon.co.uk
if (domains[c - 1].Length < 3 && domains[c - 2].Length <= 3)
return string.Join(".", domains, c - 3, 3);
else
return string.Join(".", domains, c - 2, 2);
}
else
return host;
}