C++ how to read from unicode files by ignoring first character of each line

寵の児 提交于 2020-01-23 17:59:51

问题


Consider a file containing Unicode words as follows

آب
آباد
آبادان

if you read right to left, the first character is " آ ".

My first requirement is to read the file line by line. This would be simple.

The second requirement is to read the file line by line from the second character of each line. the result must be something like this

ب
باد
بادان

As you know there are some solutions like std::substr to meet the second requirement but Afaik std::substr does not works well with Unicode Characters.

I need something like this

std::ifstream inFile(file_name);
//Solution for first requirement
std::string line;
if (!std::getline(inFile, line)) {
   std::cout << "failed to read file " << file_name << std::endl;
   inFile.close();
   break;
}
line.erase(line.find_last_not_of("\n\r") + 1);

std::string line2;
//what should be here to meet my second requirement?
//stay on current line      
//ignore first character and std::getline(inFile, line2)) 
line2.erase(line.find_last_not_of("\n\r") + 1);

std::cout<<"Line= "<<line<<std::cout; //should prints آب
std::cout<<"Line2= "<<line<<std::cout; //should prints 

inFile.close();

回答1:


C++11 has unicode conversion routines but they are not very user friendly. But you can make more user friendly functions with them like this:

// This should convert to whatever the system wide character encoding
// is for the platform (UTF-32/Linux - UCS-2/Windows)
std::string ws_to_utf8(std::wstring const& s)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::string utf8 = cnv.to_bytes(s);
    if(cnv.converted() < s.size())
        throw std::runtime_error("incomplete conversion");
    return utf8;
}

std::wstring utf8_to_ws(std::string const& utf8)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::wstring s = cnv.from_bytes(utf8);
    if(cnv.converted() < utf8.size())
        throw std::runtime_error("incomplete conversion");
    return s;
}

std::string remove_first_char(std::string const& utf8)
{
    std::wstring ws = utf8_to_ws(utf8);
    ws = ws.substr(1);
    return ws_to_utf8(ws);
}

int main()
{
    std::string utf8 = u8"آبادان";

    std::cout << remove_first_char(utf8) << '\n';
}

Output:

بادان

By converting to a fixed with code-point (UCS-2/UTF-32) you can process the string using the normal string functions. There is a caveat though. UCS-2 does not cover all characters of all languages so you may have to use std::u32string and write a conversion function between UTF-8 and UTF-32.

This answer has an example: https://stackoverflow.com/a/43302460/3807729



来源:https://stackoverflow.com/questions/45565566/c-how-to-read-from-unicode-files-by-ignoring-first-character-of-each-line

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!