问题
Consider a file containing Unicode words as follows
آب
آباد
آبادان
if you read right to left, the first character is " آ ".
My first requirement is to read the file line by line. This would be simple.
The second requirement is to read the file line by line from the second character of each line. the result must be something like this
ب
باد
بادان
As you know there are some solutions like std::substr to meet the second requirement but Afaik std::substr does not works well with Unicode Characters.
I need something like this
std::ifstream inFile(file_name);
//Solution for first requirement
std::string line;
if (!std::getline(inFile, line)) {
std::cout << "failed to read file " << file_name << std::endl;
inFile.close();
break;
}
line.erase(line.find_last_not_of("\n\r") + 1);
std::string line2;
//what should be here to meet my second requirement?
//stay on current line
//ignore first character and std::getline(inFile, line2))
line2.erase(line.find_last_not_of("\n\r") + 1);
std::cout<<"Line= "<<line<<std::cout; //should prints آب
std::cout<<"Line2= "<<line<<std::cout; //should prints
inFile.close();
回答1:
C++11 has unicode conversion routines but they are not very user friendly. But you can make more user friendly functions with them like this:
// This should convert to whatever the system wide character encoding
// is for the platform (UTF-32/Linux - UCS-2/Windows)
std::string ws_to_utf8(std::wstring const& s)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
std::string utf8 = cnv.to_bytes(s);
if(cnv.converted() < s.size())
throw std::runtime_error("incomplete conversion");
return utf8;
}
std::wstring utf8_to_ws(std::string const& utf8)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
std::wstring s = cnv.from_bytes(utf8);
if(cnv.converted() < utf8.size())
throw std::runtime_error("incomplete conversion");
return s;
}
std::string remove_first_char(std::string const& utf8)
{
std::wstring ws = utf8_to_ws(utf8);
ws = ws.substr(1);
return ws_to_utf8(ws);
}
int main()
{
std::string utf8 = u8"آبادان";
std::cout << remove_first_char(utf8) << '\n';
}
Output:
بادان
By converting to a fixed with code-point (UCS-2/UTF-32) you can process the string using the normal string functions. There is a caveat though. UCS-2 does not cover all characters of all languages so you may have to use std::u32string and write a conversion function between UTF-8 and UTF-32.
This answer has an example: https://stackoverflow.com/a/43302460/3807729
来源:https://stackoverflow.com/questions/45565566/c-how-to-read-from-unicode-files-by-ignoring-first-character-of-each-line