Is there any way to extract the first letter of a UTF-8 encoded string with Lua?
Lua does not properly support Unicode, so string.sub(\"ÆØÅ\", 2, 2)
wil
You can easily extract the first letter from a UTF-8 encoded string with the following code:
function firstLetter(str)
return str:match("[%z\1-\127\194-\244][\128-\191]*")
end
Because a UTF-8 code point either begins with a byte from 0 to 127, or with a byte from 194 to 244 followed by one or several bytes from 128 to 191.
You can even iterate over UTF-8 code points in a similar manner:
for code in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do
print(code)
end
Note that both examples return a string value for each letter, and not the Unicode code point numerical value.
Lua 5.3 provide a UTF-8 library.
You can use utf8.codes
to get each code point, and then use utf8.char
to get the character:
local str = "ÆØÅ"
for _, c in utf8.codes(str) do
print(utf8.char(c))
end
This also works:
local str = "ÆØÅ"
for w in str:gmatch(utf8.charpattern ) do
print(w)
end
where utf8.charpattern
is just the string "[\0-\x7F\xC2-\xF4][\x80-\xBF]*"
for the pattern to match one UTF-8 byte sequence.