Matching Unicode letters with RegExp

别来无恙 提交于 2021-02-04 07:00:07

问题


I am in need of matching Unicode letters, similarly to PCRE's \p{L}.

Now, since Dart's RegExp class is based on ECMAScript's, it doesn't have the concept of \p{L}, sadly.

I'm looking into perhaps constructing a big character class that matches all Unicode letters, but I'm not sure where to start.

So, I want to match letters like:

foobar
מכון ראות

But the R symbol shouldn't be matched:

BlackBerry®

Neither should any ASCII control characters or punctuation marks, etc. Essentially every letter in every language Unicode supports, whether it's å, ä, φ or ת, they should match if they are actual letters.


回答1:


I know this is an old question. But RegExp now supports unicode categories (since Dart 2.4) so you can do something like this:

RegExp alpha = RegExp(r'\p{Letter}', unicode: true);
print(alpha.hasMatch("f")); // true
print(alpha.hasMatch("ת")); // true
print(alpha.hasMatch("®")); // false



回答2:


I don't think that complete information about classification of Unicode characters as letters or non-letters is anywhere in the Dart libraries. You might be able to put something together that would mostly work using things in the Intl library, particularly Bidi. I'm thinking that, for example,

isLetter(oneCharacterString) => Bidi.endsWithLtr(oneLetterString) || Bidi.endsWithRTL(oneLetterString);

might do a plausible job. At least it seems to have a number of ranges for valid characters in there. Or you could put together your own RegExp based on the information in _LTR_CHARS and _RTL_CHARS. It explicitly says it's not 100% accurate, but good for most practical purposes.




回答3:


Looks like you're going to have to iterate through the runes in the string and then check the integer value against a table of unicode ranges.

Golang has some code to generate these tables directly from the unicode source. See maketables.go, and some of the other files in the golang unicode package.

Or take the lazy option, and file a Dart bug, and wait for the Dart team to implement it ;)




回答4:


There's no support for this yet in Dart or JS.

The Xregexp JS library has support for generating fairly large character class regexps to support something like this. You may be able to generate the regexp, print it and cut and paste it into your app.



来源:https://stackoverflow.com/questions/15531928/matching-unicode-letters-with-regexp

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!