Unicode and :alpha:

回眸只為那壹抹淺笑 提交于 2019-12-30 08:32:24

问题


Why is this false:

iex(1)> String.match?("汉语漢語", ~r/^[[:alpha:]]+$/)
false

But this is true?:

iex(2)> String.match?("汉语漢語", ~r/[[:alpha:]]/)
true

Sometimes [:alpha:] is unicode and sometimes it's not?

EDIT:

I don't think my original example was clear enough.

Why is this false:

iex(1)> String.match?("汉", ~r/^[[:alpha:]]+$/)
false

But this is true?:

iex(2)> String.match?("汉", ~r/[[:alpha:]]/)
true

回答1:


When you pass the string to the regex in a non-Unicode mode, it is treated as an array of bytes, not as a Unicode string. See IO.puts byte_size("汉语漢語") (12, all bytes that the input consists of: 230,177,137,232,175,173,230,188,162,232,170,158) and IO.puts String.length("汉语漢語") (4, the Unicode "letters") difference. There are bytes in the string that cannot be matched with the [:alpha:] POSIX character class. Thus, the first expression does not work, while the second works as it only needs 1 character to return a valid match.

To properly match Unicode strings with PCRE regex library (that is used in Elixir), you need to enable the Unicode mode with /u modifier:

IO.puts String.match?("汉语漢語", ~r/^[[:alpha:]]+$/u)

See the IDEONE demo (prints true)

See Elixir regex reference:

unicode (u) - enables unicode specific patterns like \p and changes modifiers like \w, \W, \s and friends to also match on unicode. It expects valid unicode strings to be given on match.



来源:https://stackoverflow.com/questions/33586468/unicode-and-alpha

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!