I am tring to extract the domain names out of a list of urls. just like in
https://stackoverflow.com/questions/18331948/extract-domain-name-from-the-url
My problem is th
With regex, you could use something like this:
(?<=\.)([^.]+)(?:\.(?:co\.uk|ac\.us|[^.]+(?:$|\n)))
https://regex101.com/r/WQXFy6/5
Notice, you'll have to watch out for special cases such as co.uk
.
Simple solution via regex
import re
def domain_name(url):
return url.split("www.")[-1].split("//")[-1].split(".")[0]
Use tldextract
which is more efficient version of urlparse
, tldextract
accurately separates the gTLD
or ccTLD
(generic or country code top-level domain) from the registered domain
and subdomains
of a URL.
>>> import tldextract
>>> ext = tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
>>> ext.domain
'cnn'
It seems you can use urlparse https://docs.python.org/3/library/urllib.parse.html for that url, and then extract the netloc.
And from the netloc you could easily extract the domain name by using split