If I want to use C++11\'s regular expressions with unicode strings, will they work with char* as UTF-8 or do I have to convert them to a wchar_t* string?
C++11 regular expressions will "work with" UTF-8 just fine, for a minimal definition of "work". If you want "complete" Unicode regular expression support for UTF-8 strings, you will be better off with a library that supports that directly such as http://www.pcre.org/ .
Yes they will, this is by design of the UTF-8 encoding. Substring operations should work correctly if the string is treated as an array of bytes rather than an array of codepoints.
See FAQ #18 here: http://www.utf8everywhere.org/#faq.validation about how this is achieved in this encoding's design.
You would need to test your compiler and the system you are using, but in theory, it will be supported if your system has a UTF-8 locale. The following test returned true for me on Clang/OS X.
bool test_unicode()
{
std::locale old;
std::locale::global(std::locale("en_US.UTF-8"));
std::regex pattern("[[:alpha:]]+", std::regex_constants::extended);
bool result = std::regex_match(std::string("abcdéfg"), pattern);
std::locale::global(old);
return result;
}
NOTE: This was compiled in a file what was UTF-8 encoded.
Just to be safe I also used a string with the explicit hex versions. It worked also.
bool test_unicode2()
{
std::locale old;
std::locale::global(std::locale("en_US.UTF-8"));
std::regex pattern("[[:alpha:]]+", std::regex_constants::extended);
bool result = std::regex_match(std::string("abcd\xC3\xA9""fg"), pattern);
std::locale::global(old);
return result;
}
Update test_unicode()
still works for me
$ file regex-test.cpp
regex-test.cpp: UTF-8 Unicode c program text
$ g++ --version
Configured with: --prefix=/Applications/Xcode-8.2.1.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
InstalledDir: /Applications/Xcode-8.2.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
I have a use-case where I need to handle potentially unicode strings when looking for Cartesian coordinates, and this sample shows how I handle it as advised for std::wregex
and std::wstring
, against potentially unicode characters for a parsing module.
static bool isCoordinate(std::wstring token)
{
std::wregex re(L"^(-?[[:digit:]]+)$");
std::wsmatch match;
return std::regex_search(token, match, re);
}
int wmain(int argc, wchar_t * argv[])
{
// Testing against not a number nor unicode designation
bool coord = ::isCoordinate(L"أَبْجَدِيَّة عَرَبِيَّة中文");
if (!coord)
return 0;
return 1;
}