python regular expression for domain names

岁酱吖の 提交于 2019-12-10 20:39:21

问题


I am trying use the following regular expression to extract domain name from a text, but it just produce nothing, what's wrong with it? I don't know if this is suitable to ask this "fix code" question, maybe I should read more. I just want to save some time. Thanks

pat_url = re.compile(r'''

            (?:https?://)*

            (?:[\w]+[\-\w]+[.])*

            (?P<domain>[\w\-]*[\w.](com|net)([.](cn|jp|us))*[/]*)

            ''')

print re.findall(pat_url,"http://www.google.com/abcde")

I want the output to be google.com


回答1:


Don't use regex for this. Use the urlparse standard library instead. It's far more straightforward and easier to read/maintain.

http://docs.python.org/library/urlparse.html




回答2:


The first is that you're missing the re.VERBOSE flag in the call to re.compile(). The second is that you should use the methods on the returned object. The third is that you're using a regular expression where an appropriate parser already exists in the stdlib.




回答3:


This is the only correct way to parse an url with a regex:

It's in C++ but you'll find trivial to convert to python by removing additional \. And with an enum for the captures.

Also see RFC3986 as original source for the regexp.

static const char* const url_regex[] = {
    /* RE_URL */
    "^(([^:/?#]+):)?(//([^/?#]*)|///)?([^?#]*)(\\?[^#]*)?(#.*)?",
};

enum {
    URL = 0,
    SCHEME_CLN = 1,
    SCHEME  = 2,
    DSLASH_AUTH = 3,
    AUTHORITY = 4,
    PATH    = 5,
    QUERY   = 6,
    FRAGMENT = 7
};



回答4:


I don't believe that this is actually about "regression", is it? It's about regular expressions, which is a totally different thing. Perhaps someone should fix the tagging.




回答5:


([a-z0-9][-a-z0-9]*[a-z0-9]|[a-z0-9])\.(COMMUNITY|DIRECTORY|EDUCATION|EQUIPMENT|INSTITUTE|MARKETING|SOLUTIONS|XN--J1AMH|XN--L1ACC|BARGAINS|BOUTIQUE|BUILDERS|CATERING|CLEANING|CLOTHING|COMPUTER|DEMOCRAT|DIAMONDS|GRAPHICS|HOLDINGS|LIGHTING|PARTNERS|PLUMBING|TRAINING|VENTURES|XN--P1AI|ACADEMY|CAREERS|COMPANY|CRUISES|DOMAINS|EXPOSED|FLIGHTS|FLORIST|GALLERY|GUITARS|HOLIDAY|KITCHEN|RECIPES|RENTALS|REVIEWS|SHIKSHA|SINGLES|SUPPORT|SYSTEMS|AGENCY|BERLIN|CAMERA|CENTER|COFFEE|CONDOS|DATING|ESTATE|EVENTS|EXPERT|FUTBOL|KAUFEN|LUXURY|MAISON|MONASH|MUSEUM|NAGOYA|PHOTOS|REPAIR|REPORT|SOCIAL|TATTOO|TIENDA|TRAVEL|VIAJES|VILLAS|VISION|VOTING|VOYAGE|BUILD|CARDS|CHEAP|CODES|DANCE|EMAIL|GLASS|HOUSE|NINJA|PARTS|PHOTO|SHOES|SOLAR|TODAY|TOKYO|TOOLS|WATCH|WORKS|AERO|ARPA|ASIA|BIKE|BLUE|BUZZ|CAMP|CLUB|COOL|COOP|FARM|GIFT|GURU|INFO|JOBS|KIWI|LAND|LIMO|LINK|MENU|MOBI|MODA|NAME|PICS|PINK|POST|QPON|RICH|RUHR|SEXY|TIPS|WANG|WIEN|ZONE|BIZ|CAB|CAT|CEO|COM|EDU|GOV|INT|KIM|MIL|NET|ONL|ORG|PRO|RED|TEL|UNO|WED|XXX|AC|AD|AE|AF|AG|AI|AL|AM|AN|AO|AQ|AR|AS|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|CR|CU|CV|CW|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|IO|IQ|IR|IS|IT|JE|JM|JO|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MK|ML|MM|MN|MO|MP|MQ|MR|MS|MT|MU|MV|MW|MX|MY|MZ|NA|NC|NE|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SX|SY|SZ|TC|TD|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|YE|YT|ZA|ZM|ZW)(?![-0-9a-z])(?!\.[a-z0-9])

This Regex uses all current valid TLDs found http://data.iana.org/TLD/tlds-alpha-by-domain.txt it will take a list of text and only return the domain.tld

Eg. Feed it

  • http://google.com
  • www.hotmail.com
  • yahoo.net

Will return

  • google.com
  • hotmail.com
  • yahoo.net

This isn't ideal, as the regex is quite long but it worked for what I needed at the time, hope it was helpful.



来源:https://stackoverflow.com/questions/2626995/python-regular-expression-for-domain-names

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!