Javascript - regex - word boundary (\b) issue

前端未结

关注

 3  1126

悲哀的现实

I have a difficulty using \\b and greek characters in a regex.

At this example [a-zA-ZΆΈ-ώἀ-ῼ]* succeeds to mark all the words I want (both

相关标签:

3条回答

抹茶落季

2020-12-01 16:40
Try something like this:
```
\s[a-zA-ZΆΈ-ώἀ-ῼ]{2}\s
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
鱼传尺愫

2020-12-01 16:53
You can use \S

Rather than write a match for "word characters plus these characters" it may be appropriate to use a regex that matches not-whitespace:
```
\S
```
It's broader in scope, but simpler to write/use.

If that's too broad - use an exclusive list rather than an inclusive list:
```
[^\s\.]
```
That is - any character that is not whitespace and not a dot. In this way it's also easy to add to the exceptions.

Don't try to use \b

Word boundaries don't work with none-ascii characters which is easy to demonstrate:
```
> "yay".match(/\b.*\b/)
["yay"]
> "γaγ".match(/\b.*\b/)
["a"]
```
Therefore it's not possible to use \b to detect words with greek characters - every character is a matching boundary.

Match 2 character words

The following pattern can be used to match two character words:
```
pattern = /(^|[\s\.,])(\S{2})(?=$|[\s\.,])/g;
```
(More accurately: to match two none-whitespace sequences).

That is:
```
(^|[\s\.,]) - start of string or whitespace/punctuation (back reference 1)
(\S{2})     - two not-whitespace characters (back reference 2)
($|[\s\.,]) - end of string or whitespace/punctuation (positive lookahead)
```
That pattern can be used like so to remove matching words:
```
"input string".replace(pattern);
```
Here's a jsfiddle demonstrating the patterns use on the texts in the question.
0 讨论(0)
发布评论:

提交评论
- 加载中...
暗喜

2020-12-01 16:55
Since Javascript doesn't have the lookbehind feature and since word boundaries work only with members of the \w character class, the only way is to use groups (and capturing groups if you want to make a replacement):
```
(?m)(^|[^a-zA-ZΆΈ-ώἀ-ῼ\n])([a-zA-ZΆΈ-ώἀ-ῼ]{2})(?![a-zA-ZΆΈ-ώἀ-ῼ])
```
example to remove 2 letters words:
```
txt = txt.replace(/(^|[^a-zA-ZΆΈ-ώἀ-ῼ\n])([a-zA-ZΆΈ-ώἀ-ῼ]{2})(?![a-zA-ZΆΈ-ώἀ-ῼ])/gm, '\1');
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

Javascript - regex - word boundary (\b) issue

You can use \S

Don't try to use \b

Match 2 character words