Javascript - regex - word boundary (\b) issue

前端 未结 3 1134
悲哀的现实
悲哀的现实 2020-12-01 16:00

I have a difficulty using \\b and greek characters in a regex.

At this example [a-zA-ZΆΈ-ώἀ-ῼ]* succeeds to mark all the words I want (both

3条回答
  •  鱼传尺愫
    2020-12-01 16:53

    You can use \S

    Rather than write a match for "word characters plus these characters" it may be appropriate to use a regex that matches not-whitespace:

    \S
    

    It's broader in scope, but simpler to write/use.

    If that's too broad - use an exclusive list rather than an inclusive list:

    [^\s\.]
    

    That is - any character that is not whitespace and not a dot. In this way it's also easy to add to the exceptions.

    Don't try to use \b

    Word boundaries don't work with none-ascii characters which is easy to demonstrate:

    > "yay".match(/\b.*\b/)
    ["yay"]
    > "γaγ".match(/\b.*\b/)
    ["a"]
    

    Therefore it's not possible to use \b to detect words with greek characters - every character is a matching boundary.

    Match 2 character words

    The following pattern can be used to match two character words:

    pattern = /(^|[\s\.,])(\S{2})(?=$|[\s\.,])/g;
    

    (More accurately: to match two none-whitespace sequences).

    That is:

    (^|[\s\.,]) - start of string or whitespace/punctuation (back reference 1)
    (\S{2})     - two not-whitespace characters (back reference 2)
    ($|[\s\.,]) - end of string or whitespace/punctuation (positive lookahead)
    

    That pattern can be used like so to remove matching words:

    "input string".replace(pattern);
    

    Here's a jsfiddle demonstrating the patterns use on the texts in the question.

提交回复
热议问题