Do C++11 regular expressions work with UTF-8 strings?

后端 未结 4 2168
一个人的身影
一个人的身影 2020-12-01 07:57

If I want to use C++11\'s regular expressions with unicode strings, will they work with char* as UTF-8 or do I have to convert them to a wchar_t* string?

相关标签:
4条回答
  • 2020-12-01 08:17

    C++11 regular expressions will "work with" UTF-8 just fine, for a minimal definition of "work". If you want "complete" Unicode regular expression support for UTF-8 strings, you will be better off with a library that supports that directly such as http://www.pcre.org/ .

    0 讨论(0)
  • 2020-12-01 08:18

    Yes they will, this is by design of the UTF-8 encoding. Substring operations should work correctly if the string is treated as an array of bytes rather than an array of codepoints.

    See FAQ #18 here: http://www.utf8everywhere.org/#faq.validation about how this is achieved in this encoding's design.

    0 讨论(0)
  • 2020-12-01 08:20

    You would need to test your compiler and the system you are using, but in theory, it will be supported if your system has a UTF-8 locale. The following test returned true for me on Clang/OS X.

    bool test_unicode()
    {
        std::locale old;
        std::locale::global(std::locale("en_US.UTF-8"));
    
        std::regex pattern("[[:alpha:]]+", std::regex_constants::extended);
        bool result = std::regex_match(std::string("abcdéfg"), pattern);
    
        std::locale::global(old);
    
        return result;
    }
    

    NOTE: This was compiled in a file what was UTF-8 encoded.


    Just to be safe I also used a string with the explicit hex versions. It worked also.

    bool test_unicode2()
    {
        std::locale old;
        std::locale::global(std::locale("en_US.UTF-8"));
    
        std::regex pattern("[[:alpha:]]+", std::regex_constants::extended);
        bool result = std::regex_match(std::string("abcd\xC3\xA9""fg"), pattern);
    
        std::locale::global(old);
    
        return result;
    }
    

    Update test_unicode() still works for me

    $ file regex-test.cpp 
    regex-test.cpp: UTF-8 Unicode c program text
    
    $ g++ --version
    Configured with: --prefix=/Applications/Xcode-8.2.1.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
    Apple LLVM version 8.0.0 (clang-800.0.42.1)
    Target: x86_64-apple-darwin15.6.0
    Thread model: posix
    InstalledDir: /Applications/Xcode-8.2.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
    
    0 讨论(0)
  • 2020-12-01 08:31

    I have a use-case where I need to handle potentially unicode strings when looking for Cartesian coordinates, and this sample shows how I handle it as advised for std::wregex and std::wstring, against potentially unicode characters for a parsing module.

    static bool isCoordinate(std::wstring token)
    {   
        std::wregex re(L"^(-?[[:digit:]]+)$");
        std::wsmatch match;
        return std::regex_search(token, match, re);
    }
    
    int wmain(int argc, wchar_t * argv[])
    {
        // Testing against not a number nor unicode designation
        bool coord = ::isCoordinate(L"أَبْجَدِيَّة عَرَبِيَّة‎中文"); 
    
        if (!coord)
            return 0;
        return 1;
    }
    
    0 讨论(0)
提交回复
热议问题