Why does \w match only English words in javascript regex?

后端 未结 10 879
借酒劲吻你
借酒劲吻你 2020-12-09 20:26

I\'m trying to find URLs in some text, using javascript code. The problem is, the regular expression I\'m using uses \\w to match letters and digits inside the URL, but it d

10条回答
  •  感情败类
    2020-12-09 20:52

    Note that URIs (as superset of URLs) are specified by W3C to only allow US-ASCII characters. Normally all other characters should be represented by percent-notation:

    In local or regional contexts and with improving technology, users might benefit from being able to use a wider range of characters; such use is not defined by this specification. Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI. // URI: Generic Syntax

    Which is what generally happens when you open an URL with non-ASCII characters in browser, they get translated into %AB notation, which, in turn, is US-ASCII.

    If it is possible to influence the way the material is created, the best option would be to subject URLs to urlencode() type function during their creation.

提交回复
热议问题