How to include ё in [а-я] regexp char interval

ぐ巨炮叔叔 提交于 2019-12-01 02:10:14

This is cool - I had never thought that much about character ranges in unicode.

It seems that for some reason А-я were encoded in the unicode range 0x410 to 0x44f, but some other characters (such as ё) were added in 0x400 to 0x410 and then 0x450 to 0x45f (wikipedia has a full breakdown of what characters went where)

As a consequence, /[Ѐ-ё]/ should work, but might feel quite illogical to a native speaker.

You can of course do raw unicode escapes, i.e. /[\u0400-\u045f]/ (or up until \u04ff if you want the full cyrillic block) but that does make you either remember that (or assign it to some constant for future use).

Lastly, you can refer to entire scripts with

/\p{Cyrillic}/

although my understanding is that this includes more characters, such as Ԧ

Sergio Belevskij

Is one, but not are beatifull decision: use [/а-ё/] instead of [/а-яё/]. This worked, but letter not in proper direction:

str = "верёвочка"
str[/^[а-ё]+$/]
#=> "верёвочка"

The original /а-яА-Я/ and /а-яА-ЯёЁ/ patterns just match sequences of literal chars, а-яА-Я and а-яА-ЯёЁ strings respectively, since the char ranges are not enclosed with [ and ] that would form a character class. Even if they were, without a quantifier, that would only match a single char that falls within the range(s).

To match a sequence of one or more Russian letters, you need either of:

/[а-яА-ЯёЁ]+/
/[а-яё]+/i

See the Rubular demo

Note that there is NO Unicode category class like \p{Russian}, and \p{Cyrillic} matches all Cyrillic chars, not just the Russian ones. The letter Ёё does not fall into the range between а-я and А-Я and **must be added "manually", see the Unicode table:

And here is the Ruby demo:

s = "Верёвочка - 12"
puts s[/[а-яА-ЯёЁ]+/] # => Верёвочка
puts s[/[а-яё]+/i]    # => Верёвочка
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!