How can I remove all characters from a string that are not letters using a JavaScript RegEx?
RegEx instance properties used g
, i
global : Whether to test the regular expression against all possible matches in a string, or only against the first.
ignoreCase : Whether to ignore case while attempting a match in a string.
RegEx special characters used [a-z]
, +
[^xyz] : A negated or complemented character set. That is, it matches anything that is not enclosed in the brackets. You can specify a range of characters by using a hyphen.
For example,
[abcd]
is the same as[a-d]
. They match the 'b' in "brisket" and the 'c' in "chop".+ : Matches the preceding item 1 or more times. Equivalent to {1,}.
JavaScript string replace method syntax
str.replace(regexp|substr, newSubStr|function[, Non-standard flags]);
The non-standard flags g
& i
can be passed in the replace syntax or built into the regex.
examples:
var re = /[^a-z]+/gi; var str = "this is a string"; var newstr = str.replace(re, ""); print(newstr);
var str = "this is a string"; var newstr = str.replace(/[^a-z]+/, "", "gi"); print(newstr);
To match whitespace characters as well \s would be added to the regex [^a-z\s]+
.
JavaScript Reference
You can use the replace method:
'Hey! The #123 sure is fun!'.replace(/[^A-Za-z]+/g, '');
>>> "HeyThesureisfun"
If you wanted to keep spaces:
'Hey! The #123 sure is fun!'.replace(/[^A-Za-z\s]+/g, '');
>>> "Hey The sure is fun"
The regex /[^a-z\s]/gi
is basically saying to match anything not the letter a-z or a space (\s), while doing this globally (the g
flag) and ignoring the case of the string (the i
flag).
Regular Expressions in ECMAScript implementations are IMHO best explained at the Mozilla Developer Network (formerly, Mozilla Developer Center) in the RegExp article of the JavaScript Language Reference pp.
However, as noted, the previous answers do not take non-English letters into account, such as umlauts and accented letters. In order not to remove those letters from the string, you have to exclude them from the character range like so:
var s = "Victor 1 jagt 2 zwölf 3 Boxkämpfer 4 quer 5 über 6 den 7 Sylter 8 Deich";
s = s.replace(/[^a-zäöüß]+/gi, "");
This approach quickly becomes tedious and hard to maintain, especially if several natural languages need to be considered (and even in proper English there are foreign words like "déjà vu" and "fiancé").
Therefore, among other PCRE features, JSX:regexp.js lets you use Regular Expressions that can use Unicode property classes, through the Unicode Character Database (UCD).
You would then write¹
var s = "Victor 1 jagt 2 zwölf 3 Boxkämpfer 4 quer 5 über 6 den 7 Sylter 8 Deich";
var rxNotLetter = new jsx.regexp.RegExp("\\P{Ll}+", "gi");
s = s.replace(rxNotLetter, "");
or
var s = "El 1 veloz 2 murciélago 3 hindú 4 comía 5 feliz 6 cardillo 7 y 8 kiwi. La cigüeña tocaba el saxofón detrás del palenque de paja"
+ " – Съешь 1 же 2 ещё 3 этих 4 мягких 5 французских 6 булок, да 7 выпей 8 чаю.";
var rxNotLetterOrWhitespace = new jsx.regexp.RegExp("[^\\p{Ll}\\p{Lu}\\s]+", "g");
s = s.replace(rxNotLetterOrWhitespace, "");
to reduce dependency on the uppercase/lowercase quirks of implementations (and be more extensible), for a RegExp
that excludes all non-letter Unicode characters (and white-space in the second example).
Testcase
Be sure to provide a version of the Unicode Character Database as well, because it is large, in flux, and therefore not built into regexp.js (JSX contains a verbose text and compacted script version of the UCD; both can be used, and the latter is preferred, by regexp.js). Note that a conforming ECMAScript implementation does not need to support characters beyond the Basic Multilingual Plane (U+0000 to U+FFFF), so jsx.regexp.RegExp
currently cannot support those even though they are in the UCD. See the documentation in the source code for details.
¹ Pangrams from Wikipedia, the free encyclopedia.