Extract the first letter of a UTF-8 string with Lua

前端 未结 2 708
广开言路
广开言路 2020-12-06 13:06

Is there any way to extract the first letter of a UTF-8 encoded string with Lua?

Lua does not properly support Unicode, so string.sub(\"ÆØÅ\", 2, 2) wil

相关标签:
2条回答
  • 2020-12-06 13:14

    You can easily extract the first letter from a UTF-8 encoded string with the following code:

    function firstLetter(str)
      return str:match("[%z\1-\127\194-\244][\128-\191]*")
    end
    

    Because a UTF-8 code point either begins with a byte from 0 to 127, or with a byte from 194 to 244 followed by one or several bytes from 128 to 191.

    You can even iterate over UTF-8 code points in a similar manner:

    for code in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do
      print(code)
    end
    

    Note that both examples return a string value for each letter, and not the Unicode code point numerical value.

    0 讨论(0)
  • 2020-12-06 13:39

    Lua 5.3 provide a UTF-8 library.

    You can use utf8.codes to get each code point, and then use utf8.char to get the character:

    local str = "ÆØÅ"
    for _, c in utf8.codes(str) do
      print(utf8.char(c))
    end
    

    This also works:

    local str = "ÆØÅ"
    for w in str:gmatch(utf8.charpattern ) do
      print(w)
    end
    

    where utf8.charpattern is just the string "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" for the pattern to match one UTF-8 byte sequence.

    0 讨论(0)
提交回复
热议问题