问题

Why is this false:

iex(1)> String.match?("汉语漢語", ~r/^[[:alpha:]]+$/)
false

But this is true?:

iex(2)> String.match?("汉语漢語", ~r/[[:alpha:]]/)
true

Sometimes [:alpha:] is unicode and sometimes it's not?

EDIT:

I don't think my original example was clear enough.

Why is this false:

iex(1)> String.match?("汉", ~r/^[[:alpha:]]+$/)
false

But this is true?:

iex(2)> String.match?("汉", ~r/[[:alpha:]]/)
true

回答1:

When you pass the string to the regex in a non-Unicode mode, it is treated as an array of bytes, not as a Unicode string. See IO.puts byte_size("汉语漢語") (12, all bytes that the input consists of: 230,177,137,232,175,173,230,188,162,232,170,158) and IO.puts String.length("汉语漢語") (4, the Unicode "letters") difference. There are bytes in the string that cannot be matched with the [:alpha:] POSIX character class. Thus, the first expression does not work, while the second works as it only needs 1 character to return a valid match.

To properly match Unicode strings with PCRE regex library (that is used in Elixir), you need to enable the Unicode mode with /u modifier:

IO.puts String.match?("汉语漢語", ~r/^[[:alpha:]]+$/u)

See the IDEONE demo (prints true)

See Elixir regex reference:

unicode (u) - enables unicode specific patterns like \p and changes modifiers like \w, \W, \s and friends to also match on unicode. It expects valid unicode strings to be given on match.

来源：https://stackoverflow.com/questions/33586468/unicode-and-alpha

标签

regex

elixir

Unicode and :alpha:

问题

EDIT:

回答1: