Regex to remove non-letter characters but keep accented letters

扶醉桌前 提交于 2019-12-18 14:54:27

问题


I have strings in Spanish and other languages that may contain generic special characters like (),*, etc. That I need to remove. But the problem is that it also may contain special language characters like ñ, á, ó, í etc and they need to remain. So I am trying to do it with regexp the following way:

var desired = stringToReplace.replace(/[^\w\s]/gi, '');

Unfortunately it is removing all special characters including the language related. Not sure how to avoid that. Maybe someone could suggest?


回答1:


I would suggest using Steven Levithan's excellent XRegExp library and its Unicode plug-in.

Here's an example that strips non-Latin word characters from a string: http://jsfiddle.net/b3awZ/1/

var regex = XRegExp("[^\\s\\p{Latin}]+", "g");
var str = "¿Me puedes decir la contraseña de la Wi-Fi?"
var replaced = XRegExp.replace(str, regex, "");

See also this answer by Steven Levithan himself:

Regular expression Spanish and Arabic words




回答2:


Note! Works only for 16bit code points. This answer is incomplete.

Short answer

The character class for all arabic digits and latin letters is: [0-9A-Za-z\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u02af\u1d00-\u1d25\u1d62-\u1d65\u1d6b-\u1d77\u1d79-\u1d9a\u1e00-\u1eff\u2090-\u2094\u2184-\u2184\u2488-\u2490\u271d-\u271d\u2c60-\u2c7c\u2c7e-\u2c7f\ua722-\ua76f\ua771-\ua787\ua78b-\ua78c\ua7fb-\ua7ff\ufb00-\ufb06].

To get a regex you can use, prepend /^ and append +$/. This will match strings consisting of only latin letters and digits like "mérito" or "Schönheit".

To match non-digits or non-letter characters to remove them, write a ^ as first character after the opening bracket [ and prepend / and append +/.

How did I find that out? Continue reading.

Long answer: use metaprogramming!

Because Javascript does not have Unicode regexes, I wrote a Python program to iterate over the whole of Unicode and filter by Unicode name. It is difficult to get this right manually. Why not let the computer do the dirty and menial work?

import unicodedata
import re
import sys

def unicodeNameMatch(pattern, codepoint):
  try:
    return re.match(pattern, unicodedata.name(unichr(codepoint)), re.I)
  except ValueError:
    return None

def regexChr(codepoint):
  return chr(codepoint) if 32 <= codepoint < 127 else "\\u%04x" % codepoint

names = sys.argv
prev = None

js_regex = ""
for codepoint in range(pow(2, 16)):
  if any([unicodeNameMatch(name, codepoint) for name in names]):
    if prev is None: js_regex += regexChr(codepoint)
    prev = codepoint
  else:
    if not prev is None: js_regex += "-" + regexChr(prev)
    prev = None

print "[" + js_regex + "]"

Invoke it like this: python char_class.py latin digit and you get the character class mentioned above. It's an ugly char class but you know for sure that you catched all characters whose names contain latin or digit.

Browse the Unicode Character Database to view the names of all unicode characters. The name is in uppercase after the first semicolon, for example for A its the line

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;

Try python char_class.py "latin small" and you get a character class for all latin small letters.

Edit: There is a small misfeature (aka bug) in that \u271d-\u271d occurs in the regex. Perhaps this fix helps: Replace

if not prev is None: js_regex += "-" + regexChr(prev)

by

if not prev is None and prev != codepoint: js_regex += "-" + regexChr(prev)



回答3:


Instead of whitelisting characters you accept, you could try blacklisting illegal characters:

var desired = stringToReplace.replace(/[-'`~!@#$%^&*()_|+=?;:'",.<>\{\}\[\]\\\/]/gi, '')



回答4:


var desired = stringToReplace.replace(/[\u0000-\u007F][\W]/gi, '');

might do the trick.

See also this Javascript + Unicode regexes question.




回答5:


If you must insist on whitelisting here is the rawest way of doing it:

Test if string contains only letters (a-z + é ü ö ê å ø etc..)

It works by keeping track of 'all' unicode letter chars.




回答6:


Unfortunately, Javascript does not support Unicode character properties (which would be just the right regex feature for you). If changing the language is an option for you, PHP (for example) can do this:

preg_replace("/[^\pL0-9_\s]/", "", $str);

Where \pL matches any Unicode character that represents a letter (lower case, upper case, modified or unmodified).

If you have to stick with JavaScript and cannot use the library suggested by Tim Down, the only options are probably either blacklisting or whitelisting. But your bounty mentions that blacklisting is not actually an option in your case. So you will probably simply have to include the special characters from your relevant language manually. So you could simply do this:

var desired = stringToReplace.replace(/[^\w\sñáóí]/gi, '');

Or use their corresponding Unicode sequences:

var desired = stringToReplace.replace(/[^\w\s\u00F1\u00C1\u00F3\u00ED]/gi, '');

Then simply add all the ones you want to take care of. Note that the case-insensitive modifier also works with Unicode sequences.



来源:https://stackoverflow.com/questions/8340719/regex-to-remove-non-letter-characters-but-keep-accented-letters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!