I\'m trying to find URLs in some text, using javascript code. The problem is, the regular expression I\'m using uses \\w to match letters and digits inside the URL, but it d
Because \w
only matches ASCII characters 48-57 ('0'-'9'), 67-90 ('A'-'Z') and 97-122 ('a'-'z'). Hebrew characters and other special foreign language characters (for example, umlaut-o or tilde-n) are outside of that range.
Instead of matching foreign language characters (there are so many of them, in many different ASCII ranges), you might be better off looking for the characters that delineate your words - spaces, quotation marks, and other punctuation.
If you're the one generating URLs with non-english letters in it, you may want to reconsider.
If I'm interpreting the W3C correctly, URLs may only contain word characters within the latin alphabet.
I've just found XRegExp which has not been mentioned yet and I'm quite impressed with it. It is an alternative regular expression implementation, has a unicode plugin and is licensed under MIT license.
According to the website, to match unicode chars, you'd use such code:
var unicodeWord = XRegExp("^\\p{L}+$");
unicodeWord.test("Русский"); // true
unicodeWord.test("日本語"); // true
unicodeWord.test("العربية"); // true
Perhaps \S (non-whitespace).
Note that URIs (as superset of URLs) are specified by W3C to only allow US-ASCII characters. Normally all other characters should be represented by percent-notation:
In local or regional contexts and with improving technology, users might benefit from being able to use a wider range of characters; such use is not defined by this specification. Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI. // URI: Generic Syntax
Which is what generally happens when you open an URL with non-ASCII characters in browser, they get translated into %AB notation, which, in turn, is US-ASCII.
If it is possible to influence the way the material is created, the best option would be to subject URLs to urlencode() type function during their creation.
I think you are looking for this regex:
^[אבגדהוזחטיכלמנסעפצקרשתץףןםa-zA-z0-9\s\.\-_\\\/]+$