utf-8 word boundary regex in javascript

狂风中的少年 提交于 2019-11-26 02:59:02

问题


In JavaScript:

\"ab abc cab ab ab\".replace(/\\bab\\b/g, \"AB\");

correctly gives me:

\"AB abc cab AB AB\"

When I use utf-8 characters though:

\"αβ αβγ γαβ αβ αβ\".replace(/\\bαβ\\b/g, \"AB\");

the word boundary operator doesn\'t seem to work:

\"αβ αβγ γαβ αβ αβ\"

Is there a solution to this?


回答1:


The word boundary assertion does only match if a word character is not preceded or followed by another word character (so .\b. is equal to \W\w and \w\W). And \w is defined as [A-Za-z0-9_]. So \w doesn’t match greek characters. And thus you cannot use \b for this case.

What you could do instead is to use this:

"αβ αβγ γαβ αβ αβ".replace(/(^|\s)αβ(?=\s|$)/g, "$1AB")



回答2:


Not all Javascript regexp implementation has support for Unicode ad so you need to escape it

"αβ αβγ γαβ αβ αβ".replace(/\u03b1\u03b2/g, "AB"); // "AB ABγ γAB AB AB"

For mapping the characters you can take a look at http://htmlhelp.com/reference/html40/entities/symbols.html

Of course, this doesn't help with the word boundary issue (as explained in other answers) but should at least enable you to match the characters properly




回答3:


I needed something to be programmable and handle punctuation, brackets, etc.

http://jsfiddle.net/AQvyd/

var wordToReplace = '買い手',
    replacementWord = '[[BUYER]]',
    text = 'Mange 買い手 information. The selected Store and Classification will be the default on the สั่งซื้อ.'

function replaceWord(text, wordToReplace, replacementWord) {
    var re = new RegExp('(^|\\s|\\(|\'|"|,|;)' + wordToReplace + '($|\\s|\\)|\\.|\'|"|!|,|;|\\?)', 'gi');
    return text.replace(re, replacementWord);
}

I've written a javascript resource editor so this is why I've found this page and also answered it out of necessity since I couldn't find a word boundary parametarized regexp that worked well for Unicode.




回答4:


Not all the implementations of RegEx associated with Javascript engines a unicode aware.

For example Microsofts JScript using in IE is limited to ANSI.




回答5:


When you’re dealing with Unicode and natural-language words, you probably want to be more careful with boundaries than just using \b. See this answer for details and directions.



来源:https://stackoverflow.com/questions/2881445/utf-8-word-boundary-regex-in-javascript

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!