I need to go through a large list of string url\'s and extract the domain name from them.
For example:
http://www.stackoverflow.com/questions
I wrote a method (see below) which extracts a url's domain name and which uses simple String matching. What it actually does is extract the bit between the first "://" (or index 0 if there's no "://" contained) and the first subsequent "/" (or index String.length() if there's no subsequent "/"). The remaining, preceding "www(_)*." bit is chopped off. I'm sure there'll be cases where this won't be good enough but it should be good enough in most cases!
I read here that the java.net.URI class could do this (and was preferred to the java.net.URL class) but I encountered problems with the URI class. Notably, URI.getHost() gives a null value if the url does not include the scheme, i.e. the "http(s)" bit.
/**
* Extracts the domain name from {@code url}
* by means of String manipulation
* rather than using the {@link URI} or {@link URL} class.
*
* @param url is non-null.
* @return the domain name within {@code url}.
*/
public String getUrlDomainName(String url) {
String domainName = new String(url);
int index = domainName.indexOf("://");
if (index != -1) {
// keep everything after the "://"
domainName = domainName.substring(index + 3);
}
index = domainName.indexOf('/');
if (index != -1) {
// keep everything before the '/'
domainName = domainName.substring(0, index);
}
// check for and remove a preceding 'www'
// followed by any sequence of characters (non-greedy)
// followed by a '.'
// from the beginning of the string
domainName = domainName.replaceFirst("^www.*?\\.", "");
return domainName;
}