I\'m looping over a series of URLs and want to clean them up. I have the following code:
# Parse url to remove http, path and check format
o_url = URI.parse(
Why not just strip the .com or .co.uk and then split on '.' and get the last element?
some_url.host.sub(/(\.co\.uk|\.[^.]*)$/).split('.')[-1] + $1
Have to say it feels hacky. Are there any other domains like .co.uk?
For posterity, here's an update from Oct 2014:
I was looking for a more up-to-date dependency to rely on and found the public_suffix gem (RubyGems) (GitHub). It's being actively maintained and handles all the top-level domain and nested-subdomain issues by maintaining a list of the known public suffixes.
In combination with URI.parse for stripping protocol and paths, it works really well:
❯❯❯ 2.1.2 ❯ PublicSuffix.parse(URI.parse('https://subdomain.google.co.uk/path/on/path').host).domain
=> "google.co.uk"
The regular expression you'll need here can be a bit tricky, because, hostnames can be infinitely complex -- you could have multiple subdomains (ie. foo.bar.baz.com), or the top level domain (TLD) can have multiple parts (ie. www.baz.co.uk).
Ready for a complex regular expression? :)
re = /^(?:(?>[a-z0-9-]*\.)+?|)([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$/i
new_url = o_url.host.gsub(re, '\1').strip
Let's break this into two sections. ^(?:(?>[a-z0-9-]*\.)+?|)
will collect subdomains, by matching one or more groups of characters followed by a dot (greedily, so that all subdomains are matched here). The empty alternation is needed in the case of no subdomain (such as foo.com). ([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$
will collect the actual hostname and the TLD. It allows either for a one-part TLD (like .info, .com or .museum), or a two part TLD where the second part is two characters (like .oh.us or .org.uk).
I tested this expression on the following samples:
foo.com => foo.com
www.foo.com => foo.com
bar.foo.com => foo.com
www.foo.ca => foo.ca
www.foo.co.uk => foo.co.uk
a.b.c.d.e.foo.com => foo.com
a.b.c.d.e.foo.co.uk => foo.co.uk
Note that this regex will not properly match hostnames that have more than two "parts" to the TLD!
I've wrestled with this a lot in writing various and sundry crawlers and scrapers over the years. My favorite gem for solving this is FuzzyUrl by Pete Gamache: https://github.com/gamache/fuzzyurl . Its available for Ruby, JavaScript and Elixir.
This is a tricky issue. Some top-level domains do not accept registrations at the second level.
Compare example.com
and example.co.uk
. If you would simply strip everything except the last two domains, you would end up with example.com
, and co.uk
, which can never be the intention.
Firefox solves this by filtering by effective top-level domain, and they maintain a list of all these domains. More information at publicsuffix.org.
You can use this list filter out everything except the domain right next to the effective TLD. I don't know of any Ruby library that does this, but it would be a great idea to release one!
Update: there are C, Perl and PHP libraries that do this. Given the C version, you could create a Ruby extension. Alternatively, you could port the code to Ruby.
I just wrote a library to do this called Domainatrix. You can find it here: http://github.com/pauldix/domainatrix
require 'rubygems'
require 'domainatrix'
url = Domainatrix.parse("http://www.pauldix.net")
url.public_suffix # => "net"
url.domain # => "pauldix"
url.canonical # => "net.pauldix"
url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix # => "co.uk"
url.domain # => "pauldix"
url.subdomain # => "foo.bar"
url.path # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"