Return root domain from url in R

寵の児 提交于 2020-01-14 15:01:12

问题


Given website addresses, e.g.

http://www.example.com/page1/#
https://subdomain.example2.co.uk/asdf?retrieve=2

How do I return the root domain in R, e.g.

example.com
example2.co.uk

For my purposes I would define the root domain to have structure

example_name.public_suffix

where example_name excludes "www" and public_suffix is on the list here:

https://publicsuffix.org/list/effective_tld_names.dat

Is this still the best regex based solution:

https://stackoverflow.com/a/8498629/2109289

What about something in R that parses root domain based off the public suffix list, something like:

http://simonecarletti.com/code/publicsuffix/

Edited: Adding extra info based on Richard's comment

Using XML::parseURI seems to return the stuff between the first "//" and "/". e.g.

> parseURI("http://www.blog.omegahat.org:8080/RCurl/index.html")$server
[1] "www.blog.omegahat.org"

Thus, the question reduces to having an R function that can return the public suffix from the URI, or implementing the following algorithm on the public suffix list:

Algorithm
  • Match domain against all rules and take note of the matching ones.
  • If no rules match, the prevailing rule is "*".
  • If more than one rule matches, the prevailing rule is the one which is an exception rule.
  • If there is no matching exception rule, the prevailing rule is the one with the most labels.
  • If the prevailing rule is a exception rule, modify it by removing the leftmost label.
  • The public suffix is the set of labels from the domain which directly match the labels of the prevailing rule (joined by dots).
  • The registered or registrable domain is the public suffix plus one additional label.

回答1:


There are two tasks here. The first is parsing the URL to get the host name, which can be done with the httr package's parse_url function:

host <- parse_url("https://subdomain.example2.co.uk/asdf?retrieve=2")$hostname
host
# [1] "subdomain.example2.co.uk"

The second is extracting the organizational domain (or root domain, top private domain--whatever you want to call it). This can be done using the tldextract package (which is inspired by the Python package of the same name and uses Mozilla's public suffix list):

domain.info <- tldextract(host)
domain.info
#                       host subdomain   domain   tld
# 1 subdomain.example2.co.uk subdomain example2 co.uk

tldextract returns a data frame, with a row for each domain you give it, but you can easily paste together the relevant parts:

paste(domain.info$domain, domain.info$tld, sep=".")
# [1] "example2.co.uk"



回答2:


Somthing lik this should help

> strsplit(gsub("http://|https://|www\\.", "", "http://www.example.com/page1/#"), "/")[[c(1, 1)]]
[1] "example.com"

> strsplit(gsub("http://|https://|www\\.", "", "https://subdomain.example2.co.uk/asdf?retrieve=2"), "/")[[c(1, 1)]]
[1] "subdomain.example2.co.uk"


来源:https://stackoverflow.com/questions/26291025/return-root-domain-from-url-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!