Code to strip diacritical marks using ICU

前端 未结 2 1078
醉话见心
醉话见心 2020-12-18 08:29

Can somebody please provide some sample code to strip diacritical marks (i.e., replace characters having accents, umlauts, etc., with their unaccented, unumlauted, etc., cha

2条回答
  •  余生分开走
    2020-12-18 09:05

    ICU lets you transliterate a string using a specific rule. My rule is NFD; [:M:] Remove; NFC: decompose, remove diacritics, recompose. The following code takes an UTF-8 std::string as an input and returns another UTF-8 std::string:

    #include 
    #include 
    #include 
    
    std::string desaxUTF8(const std::string& str) {
        // UTF-8 std::string -> UTF-16 UnicodeString
        UnicodeString source = UnicodeString::fromUTF8(StringPiece(str));
    
        // Transliterate UTF-16 UnicodeString
        UErrorCode status = U_ZERO_ERROR;
        Transliterator *accentsConverter = Transliterator::createInstance(
            "NFD; [:M:] Remove; NFC", UTRANS_FORWARD, status);
        accentsConverter->transliterate(source);
        // TODO: handle errors with status
    
        // UTF-16 UnicodeString -> UTF-8 std::string
        std::string result;
        source.toUTF8String(result);
    
        return result;
    }
    

提交回复
热议问题