I am wondering if there is a parser or library in java for extracting the second level domain (SLD) in an URL - or failing that an algo or regex for doing the same. For exam
Don't know your purpose but Second-Level Domain may not mean much to you. You probably need to find public suffix and the domain right below it is what you are looking for.
Apache Http Component (HttpClient 4) comes with classes to handle this,
org.apache.http.impl.cookie.PublicSuffixFilter
org.apache.http.impl.cookie.PublicSuffixListParser
You need to download the public suffix list from here,
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1
Here is a list of the Alexa top 40.000 when the real TLD's are filtered out to give you a feeling: (which means Alexa does NOT count the rating together for the domain for the following) :
bp.blogspot.com---espn.go.com---files.wordpress.com---abcnews.go.com---disney.go.com---troktiko.blogspot.com---en.wordpress.com---api.ning.com---abc.go.com---220.181.38.82---213.174.154.20---abclocal.go.com---feedproxy.google.com/~r---forums.wordpress.com---googleblog.blogspot.com---1.cnm999.com/user/10008---213.174.143.196---92.42.51.201---googlewebmastercentral.blogspot.com---myespn.go.com---213.174.143.197---61.132.221.146---support.wordpress.com---dashboard.wordpress.com---sethgodin.typepad.com---paygo.17zhifu.com/user/10005---go2.wordpress.com---1.1.1.1---movies.go.com---home.comcast.net---googlesystem.blogspot.com---abcfamily.go.com---home.spaces.live.com---196.1.237.210---kaixin001.com/~record---xhamster.com/user/video---gold-oil-commodity.blogspot.com---journeyplanner.tfl.gov.uk/user/XSLT_TRIP_REQUEST2---206.108.48.238---blog.wordpress.com---67.220.92.21---183.101.80.130---211.94.190.80---youtube-global.blogspot.com---uta-net.com/user/phplib---cinema3satu.blogspot.com---119.147.41.16---sites.google.com/site/sites---kk.iij4u.or.jp/~dyo---220.181.6.19---toontown.go.com---signup.wordpress.com---thesartorialist.blogspot.com---analytics.blogspot.com---ss.iij4u.or.jp/~ceh2---67.220.92.23---gmailblog.blogspot.com---183.99.121.86---vgorode.ru/user/create---61.132.216.243---217.175.53.72---labnol.blogspot.com---adsense.blogspot.com---subscribe.wordpress.com---fimotro.blogspot.com---creators.ning.com---sarkari-naukri.blogspot.com---search.wordpress.com---orange-hiyoko.blogspot.com---cashewmaniakpop.wordpress.com---pixiehollow.go.com---adwords.blogspot.com---202.53.226.102---lorelle.wordpress.com---homestead.com/~site---multiply.com/user/signout---221.231.148.249---183.101.80.77---windowsliveintro.spaces.live.com---124.228.254.234---streaming-web.blogspot.com---id.tianya.cn/user/message---familyfun.go.com---tro-ma-ktiko.blogspot.com---about.ning.com---paygo.17zhifu.com/user/10020---tututina.blogspot.com---toolserver.org/~geohack---superjob.ru/user/resume---ejobs.ro/user/locuri-de-munca---gnula.blogspot.com---alles.or.jp/~uir---chiark.greenend.org.uk/~sgtatham---woork.blogspot.com---88.208.32.218---webstreamingmania.blogspot.com---spaces.live.com---youtube.com/user/RayWilliamJohnson---cloob.com/user/login---asstr.org/~Kristen---getclicky.com/user/login---guesshermuff.blogspot.com---211.98.70.195---222.73.105.196---pp.iij4u.or.jp/~taakii---unsoloclic.blogspot.com---photoshopdisasters.blogspot.com---218.83.161.253---217.16.18.163---217.16.18.207---217.16.28.104---222.73.105.210---youtube.com/user/OldSpice---hubpages.com/user/new---pelisdvdripdd.blogspot.com---95.143.193.60---es.wordpress.com---217.16.18.206---61.147.116.146---damncoolpics.blogspot.com---family.go.com---81.176.235.162---gutteruncensorednewsr.blogspot.com---terselubung.blogspot.com---faisalardhy.blogspot.com---67.220.92.14---goodreads.com/user/show---116.228.55.34---profile.typepad.com---kaixin001.com/~truth---linkbuildersassociated.ning.com---nicotto.jp/user/mypage---ritemail.blogspot.com---hyperboleandahalf.blogspot.com---carscoop.blogspot.com---tubemogul.com/user/dash---press-gr.blogspot.com---81.176.235.164---soapnet.go.com---208.98.30.69---trelokouneli.blogspot.com---help.ning.com---id.tianya.cn/user/register---slovari.yandex.ru/~%D0%BA%D0%BD%D0%B8%D0%B3%D0%B8---printable-coupons.blogspot.com---unic77.blogspot.com---globaleconomicanalysis.blogspot.com---183.101.80.68---221.194.33.60---doujin-games88.blogspot.com---magaseek.com/user/SearchProducts---files.posterous.com---wwwnew.splinder.com---kolom-tutorial.blogspot.com---strobist.blogspot.com---67.21.91.73---needanarticle.com/user/activity---forum.moe.gov.om/~moeoman---milasdaydreams.blogspot.com---88.208.17.189---67.220.92.22---115.238.100.211---nonews-news.blogspot.com---testosterona.blog.br---nn.iij4u.or.jp/~has---cs.tut.fi/~jkorpela---youtube.com/user/oldspice---67.159.53.25---taxalia.blogspot.com---208.98.30.70---filmesporno.blog.br---alles-schallundrauch.blogspot.com---vatera.hu/user/account---78.140.136.182---us.my.alibaba.com/user/join---stores.homestead.com---pes2008editing.blogspot.com---ocn.ne.jp/~matrix---adweek.blogs.com---115.238.55.94---markjaquith.wordpress.com---k3.dion.ne.jp/~dreamlov---38.99.186.222---film.tv.it---android-developers.blogspot.com---217.218.110.147---kadokado.com/user/login---bollyvideolinks4u.blogspot.com---sookyeong.wordpress.com---87.101.230.11---livecodes.blogspot.com---67.220.91.19---homepage2.nifty.com/bustered---pp.iij4u.or.jp/~manga100---110.173.49.202---erogamescape.dyndns.org/~ap2---cs.berkeley.edu/~lorch---cakewrecks.blogspot.com---59.106.117.185---119.75.213.61---id.wordpress.com---de.wordpress.com---telefilmdblink.blogspot.com---61.139.105.138---multiply.com/user/join---programseo.blogspot.com---collectivebias.ning.com---bablorub.blogspot.com---thinkexist.com/user/personalAccount---us.my.alibaba.com/user/sign---66.70.56.90---getsarkari-naukri.blogspot.com---59.106.117.183---productreviewplace.ning.com---support.weebly.com---kaixin001.com/~lucky---football-russia.blogspot.com---magaseek.com/user/ItemDetail---polprav.blogspot.com---atlasshrugs2000.typepad.com---jpn-manga.blogspot.com---88.208.32.219---google-latlong.blogspot.com---59.106.117.188---erogamescape.ddo.jp/~ap2---218.87.32.245---watchhorrormovies.blogspot.com---sarotiko.blogspot.com---googlewebmastercentral-de.blogspot.com---colmeia.blog.br---us.my.alibaba.com/user/webatm---220.170.79.109---darkville.blogspot.com---youtube.com/user/PiMPDailyDose---disneymovierewards.go.com---fukuoka.lg.jp---61.147.115.16---iisc.ernet.in---youtube.com/user/HuskyStarcraft---202.108.212.211---homepage3.nifty.com/otakarando---94.77.215.37---pitchit.ning.com---59.106.117.186---thestar.blogs.com---1.254.254.254---piratesonline.go.com---animedblink.blogspot.com---137.32.44.152---eurus.dti.ne.jp/~yoneyama---state.la.us---lastminute.is.it---bangpai.taobao.com/user/groups---csse.monash.edu.au/~jwb---jquery-howto.blogspot.com---sakura.ne.jp/~moesino---users.skynet.be/mgueury---saitama.lg.jp---portaldasfinancas.gov.pt---bnonline.fi.cr---135.125.60.11---zhuhai.gd.cn---kuna.net.kw---59.175.213.77---58.218.199.7---multiply.com/user/signin---youtube.com/user/HDstarcraft---blinklist.com/user/join---us.my.alibaba.com/user/company---jptwitterhelp.blogspot.com---67.220.92.017---88.208.17.51---youtube.com/user/GoogleWebmasterHelp---208.53.156.229---filmdblink.blogspot.com---blinklist.com/user/signup---3arbtop.blogspot.com---attivissimo.blogspot.com---onlinemovie12.blogspot.com---98.126.189.86---mytvsource.blogspot.com---blinklist.com/user/login---googlejapan.blogspot.com---76.73.65.166---gutteruncensorednewsb.blogspot.com---issuu.com/user/upload---86.51.174.18---88.208.17.120---profile.china.alibaba.com/user/admin---jntuworldportal.blogspot.com---sz.js.cn---disneymovieclub.go.com---a1.com.mk---dd.iij4u.or.jp/~madonna---rr.iij4u.or.jp/~plasma---mlmlaunchformula.ning.com---112.78.7.151---blogdelatele.blogspot.com---googlemobile.blogspot.com---78.109.199.240---wsu.edu/~brians---internapoli-city.blogspot.com---hh.iij4u.or.jp/~dmt---kaixin001.com/~house---61.155.11.14---youtube.com/user/SHAYTARDS---turbobit.net/user/files---qjy168.com/user/do---hubpages.com/user/finished---upload2.dyndns.org---f32.aaa.livedoor.jp/~azusa---naruto-spoilers.blogspot.com---205.209.140.195---193.227.20.21---adsenseforfeeds.blogspot.com---group.ameba.jp/user/groups---
1.
Method nonePublicDomainParts from simbo1905 contribution should be corrected because of TLD that contain "."
, for example "com.ac"
:
input: "com.abc.com.ac"
output: "abc"
correct output is "com.abc"
.
To get SLD
you may cut TLD
from a given domain using method publicSuffix()
.
2.
A set should not be used because of domains that contain the same parts, for example:
input: part1.part2.part1.TLD
output: part1, part2
correct output is: part1, part2, part1
or in the form part1.part2.part1
So instead of Set<String>
use List<String>
.
After looking at these answers and not being satisfied by them I used the class com.google.common.net.InternetDomainName
to subtract the public parts of a domain name from all the parts:
Set<String> nonePublicDomainParts(String uriHost) {
InternetDomainName fullDomainName = InternetDomainName.from(uriHost);
InternetDomainName publicDomainName = fullDomainName.publicSuffix();
Set<String> nonePublicParts = new HashSet<String>(fullDomainName.parts());
nonePublicParts.removeAll(publicDomainName.parts());
return nonePublicParts;
}
That class is on maven in the guava library:
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>10.0.1</version>
<scope>compile</scope>
</dependency>
Internally this class is using a TldPatterns.class which is package private and has the list of top level domains baked into it.
Interestingly, if you look at that classes source at the link below it explicitly lists "police.uk" as a private domain name. This is correct as police.uk is a private domain controlled by the police; else criminals.police.uk will be emailing you asking for your credit card details in relation to their ongoing investigations into card fraud ;)
http://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/net/TldPatterns.java?spec=svn8c3cc7e67132f8dcaae4bd214736a8ddf6611769&r=8c3cc7e67132f8dcaae4bd214736a8ddf6611769
public static String getTopLevelDomain(String uri) {
InternetDomainName fullDomainName = InternetDomainName.from(uri);
InternetDomainName publicDomainName = fullDomainName.topPrivateDomain();
String topDomain = "";
Iterator<String> it = publicDomainName.parts().iterator();
while(it.hasNext()){
String part = it.next();
if(!topDomain.isEmpty())topDomain += ".";
topDomain += part;
}
return topDomain;
}
Just give the domain, and u will get the top level domain. download jar file from http://code.google.com/p/guava-libraries/
After reeading everything here, the correct solution should be (with guava)
InternetDomainName.from(uriHost).topPrivateDomain().toString();
errors when using Guava to get the private domain name