Is there a regular expression which matches a single grapheme cluster?

纵饮孤独 提交于 2021-02-16 06:37:46

问题


Graphemes are the user-perceived characters of a text, which in unicode may comprise of several codepoints.

From Unicode® Standard Annex #29:

It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.

Is there a regex I can use (in javascript) which will match a single grapheme cluster? e.g.

"한bar".match(/*?*/)[0] === "한"
"நிbaz".match(/*?*/)[0] === "நி"
"aa".match(/*?*/)[0] === "a"
"\r\n".match(/*?*/)[0] === "\r\n"
"💆‍♂️foo".match(/*?*/)[0] === "💆‍♂️"

回答1:


Full, easy-to-use integrated support: no. Approximations for various matching tasks: yes. From regex tutorial:

Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use \X. You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.

In .NET, Java 8 and prior, and Ruby 1.9 you can use \P{M}\p{M}+ or (?>\P{M}\p{M}) as a reasonably close substitute. To match any number of graphemes, use (?>\P{M}\p{M}*)+ as a substitute for \X+.

\X is the closest, and does not exist in any version through ES6. \P{M}\p{M}+ approximates \X, but does not exist in that form: if you have ES6 via native or transpilation, you can use /(\P{Mark})(\p{Mark}+)/gu.

But even still, that isn't sufficient. <== Read that link for all the gory details.

A proposal to segment text has been put forward, but it's not yet adopted. If you're dedicated to Chrome, you can use its non-standard Intl.v8BreakIterator to break clusters and match manually.



来源:https://stackoverflow.com/questions/53198407/is-there-a-regular-expression-which-matches-a-single-grapheme-cluster

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!