What's a good regex to include accented characters in a simple way?

后端未结

关注

 4  1116

死守一世寂寞

Right now my regex is something like this:

[a-zA-Z0-9] but it does not include accented characters like I would want to. I would also like - \' , to be included.

相关标签:

4条回答

孤街浪徒

2020-12-14 02:54
Accented Characters: DIY Character Range Subtraction

If your regex engine allows it (and many will), this will work:
```
(?i)^(?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ])+$
```
Please see the demo (you can add characters to test).

Explanation
- (?i) sets case-insensitive mode
- The ^ anchor asserts that we are at the beginning of the string
- (?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ]) matches one character...
- The lookahead (?![×Þß÷þø]) asserts that the char is not one of those in the brackets
- [-'0-9a-zÀ-ÿ] allows dash, apostrophe, digits, letters, and chars in a wide accented range, from which we need to subtract
- The + matches that one or more times
- The $ anchor asserts that we are at the end of the string
Reference

Extended ASCII Table
0 讨论(0)
发布评论:

提交评论
- 加载中...
自闭症患者

2020-12-14 02:57
You just put:
```
\p(L}\p{M}
```
in your expression. This in Unicode will match:
- any letter character (L) from any language
- and marks (M)(i.e, a character that is to be combined with another: accent, etc.)
0 讨论(0)
发布评论:

提交评论
- 加载中...
渐次进展

2020-12-14 03:03
A version without the exclusion rules:
```
^[-'a-zA-ZÀ-ÖØ-öø-ÿ]+$
```
Explanation
- The ^ anchor asserts that we are at the beginning of the string
- [...] allows dash, apostrophe, digits, letters, and chars in a wide accented range,
- The + matches that one or more times
- The $ anchor asserts that we are at the end of the string
Reference
- Extended ASCII Table
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉话见心

2020-12-14 03:03

Use a POSIX character class (http://www.regular-expressions.info/posixbrackets.html):

[-'[:alpha:]0-9] or [-'[:alnum:]]

The [:alpha:] character class matches whatever is considered "alphabetic characters" in your locale.

0 讨论(0)
发布评论:

提交评论
- 加载中...