Need a regex to validating a Url and support %20 and ()

痞子三分冷 提交于 2019-11-27 16:58:38

问题


I'm currently using the following regular expression to validation URLs:

^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+@)?  (?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|edu|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$

I borrowed this from somewhere on the web (don't remember where) to improve upon this:

^((https?|file|ftp|gopher|news|nntp):\/\/)([a-z]([a-z0-9\-]*\.)+([a-z]{2}|aero|arpa|biz|com|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel)|(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))(\/[a-z0-9_\-\.~]+)*(\/([a-z0-9_\-\.]*)(\?[a-z0-9+_\-\.%=&]*)?)?(#[a-z][a-z0-9_]*)?$

However, neither of these are capable of validating this url (which should be valid):

http://somedomain.com/users/1234/images/Staff%20Photos%202008/FirstName%20LastName_1%20(Small).jpg

The problem is the %20 and round brackets (). Try as I might, I couldn't get either of the regex above to correctly validate the url above without breaking something else. I'm not experienced with writing fancy regular expressions, so that doesn't help either. All other web results I've found fail on silly things such as this:

http://www.test..com

Help would be appreciated.


回答1:


You're validating two things with the same regular expression:

  • Well formed -- Is it syntactically correct?
  • Plausible -- Are the protocol and top-level domain plausible?

Separating these validations may be fruitful. You can use this regular expression to check that the URI is well-formed. It's from RFC 3986, Uniform Resource Identifiers (URI): Generic Syntax, appendix B (p. 50):

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

If the URI matches this regular expression, it's well formed. The match groups give you the various pieces, which are:

scheme    = $2
authority = $4
path      = $5
query     = $7
fragment  = $9

Let's see what comes out of the sample URI you gave:

2 (scheme)   : "http"
4 (authority): "somedomain.com"
5 (path)     : "/users/1234/images/Staff%20Photos%202008/FirstName%20LastName_1%20(Small).jpg"
7 (query)    : nil
9 (fragment) : nil

Now that you've got the individual pieces, you can check each one for plausibility. For example, to get the TLD from the authority, apply this regular expression to the authority:

\.([^.])$

Group 1 gives you the TLD (com, org, etc.), which you can then check against your list.



来源:https://stackoverflow.com/questions/2068405/need-a-regex-to-validating-a-url-and-support-20-and

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!