Regular expression to match and split on chinese comma in JavaScript

雨燕双飞 提交于 2020-06-25 04:57:28

问题


I have a regular expression /\s*,\s*/ that matches left spaces followed by comma then right spaces.

Example:

var str = "john,walker    james  , paul";
var arr = str.split(/\s*,\s*/);
Values in arr = [john,walker james,paul] // Size: 3

Example with Chinese characters:

var str = "继续,取消   继续 ,取消";
var arr = str.split(/\s*,\s*/);
Values in arr = ["继续,取消   继续 ,取消"] // Size: 1, All values at index 0 no splitting happened

Tried splitting characters with unicodes:

var str = "john,walker    james  , paul";
var arr = str.split(/\u0020*\u002C\u0020*/);
Values in arr = [john,walker james,paul] // Size: 3

var str = "继续,取消   继续 ,取消";
var arr= str.split(/\u0020*\u002C\u0020*/);
Values in arr = ["继续,取消   继续 ,取消"] // Size: 1, All values at index 0 no splitting happened

I went through this link but not much info was there that I can use in my scenario. Is it really impossible to create regex for Chinese characters and split them?


回答1:


An ASCII comma won't match the comma you have in Chinese text. Either replace the ASCII comma (\x2C) with the Chinese one (\uFF0C), or use a character class [,,] to match both:

var str = "继续,取消   继续 ,取消";
console.log(str.split(/\s*[,,]\s*/));

Here is a regex that will match all the commas mentioned on the Comma Wikipedia page:

/\s*(?:\uD805\uDC4D|\uD836\uDE87|[\u002C\u02BB\u060C\u2E32\u2E34\u2E41\u2E49\u3001\uFE10\uFE11\uFE50\uFE51\uFF0C\uFF64\u00B7\u055D\u07F8\u1363\u1802\u1808\uA4FE\uA60D\uA6F5\u02BD\u0312\u0313\u0314\u0315\u0326\u201A])\s*/

Note that U+1144D (NEWA COMMA) and U+1DA87 (SIGNWRITING COMMA) have to be transpiled as \uD805\uDC4D and \uD836\uDE87 in order to be compatible with the ES5 regex standard.

The following commas are handled:




回答2:


As of 2018, there is increasing support for new Regex features in JavaScript, so to match Chinese, you just do this:

const REGEX = /(\p{Script=Hani})+/gu;
'你好'.match(REGEX);
// ["你好"]

The trick is to use \p and use the right script name, Hani stands for Han script (Chinese). The full list of scripts is here: http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt




回答3:


I did it and it works!

var re1 = new RegExp("^[\u4E00-\uFA29]*$"); //Chinese character range 
var re2 = new RegExp("^[\uE7C7-\uE7F3]*$"); //Chinese character range
str = str.replace(/(^\s*)|(\s*$)/g,'');
if ((re1.test(str)) || (re2.test(str))) {
  console.log('CHINESE CHAR');
}



回答4:


just using vanilla javascript

const str = "继续,取消   继续 ,取消";

// replace all Chinese comma to English comma
const arr = str.replace(/,/ig, `,`).split(`,`);


console.log(`result arr`, arr);


来源:https://stackoverflow.com/questions/44669073/regular-expression-to-match-and-split-on-chinese-comma-in-javascript

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!