Get the second level domain of an URL (java)

前端 未结 10 891
长发绾君心
长发绾君心 2020-12-06 01:04

I am wondering if there is a parser or library in java for extracting the second level domain (SLD) in an URL - or failing that an algo or regex for doing the same. For exam

相关标签:
10条回答
  • 2020-12-06 01:05

    Don't know your purpose but Second-Level Domain may not mean much to you. You probably need to find public suffix and the domain right below it is what you are looking for.

    Apache Http Component (HttpClient 4) comes with classes to handle this,

    org.apache.http.impl.cookie.PublicSuffixFilter
    org.apache.http.impl.cookie.PublicSuffixListParser
    

    You need to download the public suffix list from here,

    http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

    0 讨论(0)
  • 2020-12-06 01:07
    1. the mentioned list + reading the wikipedia updates gives a 98% correct TLD list
    2. going yourself through http://www.iana.org/domains/root/db/ and click each nic and see the latest news gives you the other 2% (like .com.aq and .gov.an)
    3. unfortunately large "free webspace" providers are another thing to take into account e.g. the countless *.blogspot.com domains, if you download the alexa top 100.000 (free csv file) you can at least get a good overview of the most used of these that should get you for a certain percentage covered for these domains (e.g. when comparing alexa rating with stumbleupon pageviews with delicious bookmarks) (alexa sometimes only takes the topdomain while delicious really md5's every url, so 1 alexa --> multiple delicious md5 hashes
    4. apart from that sometimes in the case of twitter, that what goes after the / is also of importance if you are looking for uniqueness to rate something.

    Here is a list of the Alexa top 40.000 when the real TLD's are filtered out to give you a feeling: (which means Alexa does NOT count the rating together for the domain for the following) :

    bp.blogspot.com---espn.go.com---files.wordpress.com---abcnews.go.com---disney.go.com---troktiko.blogspot.com---en.wordpress.com---api.ning.com---abc.go.com---220.181.38.82---213.174.154.20---abclocal.go.com---feedproxy.google.com/~r---forums.wordpress.com---googleblog.blogspot.com---1.cnm999.com/user/10008---213.174.143.196---92.42.51.201---googlewebmastercentral.blogspot.com---myespn.go.com---213.174.143.197---61.132.221.146---support.wordpress.com---dashboard.wordpress.com---sethgodin.typepad.com---paygo.17zhifu.com/user/10005---go2.wordpress.com---1.1.1.1---movies.go.com---home.comcast.net---googlesystem.blogspot.com---abcfamily.go.com---home.spaces.live.com---196.1.237.210---kaixin001.com/~record---xhamster.com/user/video---gold-oil-commodity.blogspot.com---journeyplanner.tfl.gov.uk/user/XSLT_TRIP_REQUEST2---206.108.48.238---blog.wordpress.com---67.220.92.21---183.101.80.130---211.94.190.80---youtube-global.blogspot.com---uta-net.com/user/phplib---cinema3satu.blogspot.com---119.147.41.16---sites.google.com/site/sites---kk.iij4u.or.jp/~dyo---220.181.6.19---toontown.go.com---signup.wordpress.com---thesartorialist.blogspot.com---analytics.blogspot.com---ss.iij4u.or.jp/~ceh2---67.220.92.23---gmailblog.blogspot.com---183.99.121.86---vgorode.ru/user/create---61.132.216.243---217.175.53.72---labnol.blogspot.com---adsense.blogspot.com---subscribe.wordpress.com---fimotro.blogspot.com---creators.ning.com---sarkari-naukri.blogspot.com---search.wordpress.com---orange-hiyoko.blogspot.com---cashewmaniakpop.wordpress.com---pixiehollow.go.com---adwords.blogspot.com---202.53.226.102---lorelle.wordpress.com---homestead.com/~site---multiply.com/user/signout---221.231.148.249---183.101.80.77---windowsliveintro.spaces.live.com---124.228.254.234---streaming-web.blogspot.com---id.tianya.cn/user/message---familyfun.go.com---tro-ma-ktiko.blogspot.com---about.ning.com---paygo.17zhifu.com/user/10020---tututina.blogspot.com---toolserver.org/~geohack---superjob.ru/user/resume---ejobs.ro/user/locuri-de-munca---gnula.blogspot.com---alles.or.jp/~uir---chiark.greenend.org.uk/~sgtatham---woork.blogspot.com---88.208.32.218---webstreamingmania.blogspot.com---spaces.live.com---youtube.com/user/RayWilliamJohnson---cloob.com/user/login---asstr.org/~Kristen---getclicky.com/user/login---guesshermuff.blogspot.com---211.98.70.195---222.73.105.196---pp.iij4u.or.jp/~taakii---unsoloclic.blogspot.com---photoshopdisasters.blogspot.com---218.83.161.253---217.16.18.163---217.16.18.207---217.16.28.104---222.73.105.210---youtube.com/user/OldSpice---hubpages.com/user/new---pelisdvdripdd.blogspot.com---95.143.193.60---es.wordpress.com---217.16.18.206---61.147.116.146---damncoolpics.blogspot.com---family.go.com---81.176.235.162---gutteruncensorednewsr.blogspot.com---terselubung.blogspot.com---faisalardhy.blogspot.com---67.220.92.14---goodreads.com/user/show---116.228.55.34---profile.typepad.com---kaixin001.com/~truth---linkbuildersassociated.ning.com---nicotto.jp/user/mypage---ritemail.blogspot.com---hyperboleandahalf.blogspot.com---carscoop.blogspot.com---tubemogul.com/user/dash---press-gr.blogspot.com---81.176.235.164---soapnet.go.com---208.98.30.69---trelokouneli.blogspot.com---help.ning.com---id.tianya.cn/user/register---slovari.yandex.ru/~%D0%BA%D0%BD%D0%B8%D0%B3%D0%B8---printable-coupons.blogspot.com---unic77.blogspot.com---globaleconomicanalysis.blogspot.com---183.101.80.68---221.194.33.60---doujin-games88.blogspot.com---magaseek.com/user/SearchProducts---files.posterous.com---wwwnew.splinder.com---kolom-tutorial.blogspot.com---strobist.blogspot.com---67.21.91.73---needanarticle.com/user/activity---forum.moe.gov.om/~moeoman---milasdaydreams.blogspot.com---88.208.17.189---67.220.92.22---115.238.100.211---nonews-news.blogspot.com---testosterona.blog.br---nn.iij4u.or.jp/~has---cs.tut.fi/~jkorpela---youtube.com/user/oldspice---67.159.53.25---taxalia.blogspot.com---208.98.30.70---filmesporno.blog.br---alles-schallundrauch.blogspot.com---vatera.hu/user/account---78.140.136.182---us.my.alibaba.com/user/join---stores.homestead.com---pes2008editing.blogspot.com---ocn.ne.jp/~matrix---adweek.blogs.com---115.238.55.94---markjaquith.wordpress.com---k3.dion.ne.jp/~dreamlov---38.99.186.222---film.tv.it---android-developers.blogspot.com---217.218.110.147---kadokado.com/user/login---bollyvideolinks4u.blogspot.com---sookyeong.wordpress.com---87.101.230.11---livecodes.blogspot.com---67.220.91.19---homepage2.nifty.com/bustered---pp.iij4u.or.jp/~manga100---110.173.49.202---erogamescape.dyndns.org/~ap2---cs.berkeley.edu/~lorch---cakewrecks.blogspot.com---59.106.117.185---119.75.213.61---id.wordpress.com---de.wordpress.com---telefilmdblink.blogspot.com---61.139.105.138---multiply.com/user/join---programseo.blogspot.com---collectivebias.ning.com---bablorub.blogspot.com---thinkexist.com/user/personalAccount---us.my.alibaba.com/user/sign---66.70.56.90---getsarkari-naukri.blogspot.com---59.106.117.183---productreviewplace.ning.com---support.weebly.com---kaixin001.com/~lucky---football-russia.blogspot.com---magaseek.com/user/ItemDetail---polprav.blogspot.com---atlasshrugs2000.typepad.com---jpn-manga.blogspot.com---88.208.32.219---google-latlong.blogspot.com---59.106.117.188---erogamescape.ddo.jp/~ap2---218.87.32.245---watchhorrormovies.blogspot.com---sarotiko.blogspot.com---googlewebmastercentral-de.blogspot.com---colmeia.blog.br---us.my.alibaba.com/user/webatm---220.170.79.109---darkville.blogspot.com---youtube.com/user/PiMPDailyDose---disneymovierewards.go.com---fukuoka.lg.jp---61.147.115.16---iisc.ernet.in---youtube.com/user/HuskyStarcraft---202.108.212.211---homepage3.nifty.com/otakarando---94.77.215.37---pitchit.ning.com---59.106.117.186---thestar.blogs.com---1.254.254.254---piratesonline.go.com---animedblink.blogspot.com---137.32.44.152---eurus.dti.ne.jp/~yoneyama---state.la.us---lastminute.is.it---bangpai.taobao.com/user/groups---csse.monash.edu.au/~jwb---jquery-howto.blogspot.com---sakura.ne.jp/~moesino---users.skynet.be/mgueury---saitama.lg.jp---portaldasfinancas.gov.pt---bnonline.fi.cr---135.125.60.11---zhuhai.gd.cn---kuna.net.kw---59.175.213.77---58.218.199.7---multiply.com/user/signin---youtube.com/user/HDstarcraft---blinklist.com/user/join---us.my.alibaba.com/user/company---jptwitterhelp.blogspot.com---67.220.92.017---88.208.17.51---youtube.com/user/GoogleWebmasterHelp---208.53.156.229---filmdblink.blogspot.com---blinklist.com/user/signup---3arbtop.blogspot.com---attivissimo.blogspot.com---onlinemovie12.blogspot.com---98.126.189.86---mytvsource.blogspot.com---blinklist.com/user/login---googlejapan.blogspot.com---76.73.65.166---gutteruncensorednewsb.blogspot.com---issuu.com/user/upload---86.51.174.18---88.208.17.120---profile.china.alibaba.com/user/admin---jntuworldportal.blogspot.com---sz.js.cn---disneymovieclub.go.com---a1.com.mk---dd.iij4u.or.jp/~madonna---rr.iij4u.or.jp/~plasma---mlmlaunchformula.ning.com---112.78.7.151---blogdelatele.blogspot.com---googlemobile.blogspot.com---78.109.199.240---wsu.edu/~brians---internapoli-city.blogspot.com---hh.iij4u.or.jp/~dmt---kaixin001.com/~house---61.155.11.14---youtube.com/user/SHAYTARDS---turbobit.net/user/files---qjy168.com/user/do---hubpages.com/user/finished---upload2.dyndns.org---f32.aaa.livedoor.jp/~azusa---naruto-spoilers.blogspot.com---205.209.140.195---193.227.20.21---adsenseforfeeds.blogspot.com---group.ameba.jp/user/groups---

    0 讨论(0)
  • 2020-12-06 01:08

    1.

    Method nonePublicDomainParts from simbo1905 contribution should be corrected because of TLD that contain ".", for example "com.ac":

    input: "com.abc.com.ac"

    output: "abc"

    correct output is "com.abc".

    To get SLD you may cut TLD from a given domain using method publicSuffix().

    2.

    A set should not be used because of domains that contain the same parts, for example:

    input: part1.part2.part1.TLD

    output: part1, part2

    correct output is: part1, part2, part1 or in the form part1.part2.part1

    So instead of Set<String> use List<String>.

    0 讨论(0)
  • 2020-12-06 01:13

    After looking at these answers and not being satisfied by them I used the class com.google.common.net.InternetDomainName to subtract the public parts of a domain name from all the parts:

    Set<String> nonePublicDomainParts(String uriHost) {
        InternetDomainName fullDomainName = InternetDomainName.from(uriHost);
        InternetDomainName publicDomainName = fullDomainName.publicSuffix();
        Set<String> nonePublicParts = new HashSet<String>(fullDomainName.parts());
        nonePublicParts.removeAll(publicDomainName.parts());
        return nonePublicParts;
    }
    

    That class is on maven in the guava library:

        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>10.0.1</version>
            <scope>compile</scope>
        </dependency>
    

    Internally this class is using a TldPatterns.class which is package private and has the list of top level domains baked into it.

    Interestingly, if you look at that classes source at the link below it explicitly lists "police.uk" as a private domain name. This is correct as police.uk is a private domain controlled by the police; else criminals.police.uk will be emailing you asking for your credit card details in relation to their ongoing investigations into card fraud ;)

    http://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/net/TldPatterns.java?spec=svn8c3cc7e67132f8dcaae4bd214736a8ddf6611769&r=8c3cc7e67132f8dcaae4bd214736a8ddf6611769

    0 讨论(0)
  • 2020-12-06 01:13
    public static String getTopLevelDomain(String uri) {
    
    InternetDomainName fullDomainName = InternetDomainName.from(uri);
    InternetDomainName publicDomainName = fullDomainName.topPrivateDomain();
    String topDomain = "";
    
    Iterator<String> it = publicDomainName.parts().iterator();
    while(it.hasNext()){
        String part = it.next();
        if(!topDomain.isEmpty())topDomain += ".";
        topDomain += part;
    }
    return topDomain;
    }
    

    Just give the domain, and u will get the top level domain. download jar file from http://code.google.com/p/guava-libraries/

    0 讨论(0)
  • 2020-12-06 01:18

    After reeading everything here, the correct solution should be (with guava)

    InternetDomainName.from(uriHost).topPrivateDomain().toString();

    errors when using Guava to get the private domain name

    0 讨论(0)
提交回复
热议问题