I\'m writing a cross-platform application in C++. All strings are UTF-8-encoded internally. Consider the following simplified code:
#include
#
std::cout is doing exactly what it should: it's sending your UTF-8 encoded text along to the console, but your console will interpret those bytes using its current code page. You need to set your program's console to the UTF-8 code page:
#include <string>
#include <iostream>
#include <Windows.h>
int main() {
std::string test = u8"Greek: αβγδ; German: Übergrößenträger";
SetConsoleOutputCP(CP_UTF8);
std::cout << test;
}
It would be great if Windows switched the default code page to UTF-8, but they likely can't due to backwards-compatibility concerns.
The problem is not std::cout but the windows console. Using C-stdio you will get the ü with fputs( "\xc3\xbc", stdout ); after setting the UTF-8 codepage (either using SetConsoleOutputCP or chcp) and setting a Unicode supporting font in cmd's settings (Consolas should support over 2000 characters and there are registry hacks to add more capable fonts to cmd).
If you output one byte after the other with putc('\xc3'); putc('\xbc'); you will get the double tofu as the console gets them interpreted separately as illegal characters. This is probably what the C++ streams do.
See UTF-8 output on Windows console for a lenghty discussion.
For my own project, I finally implemented a std::stringbuf doing the conversion to Windows-1252. I you really need full Unicode output, this will not really help you, however.
An alternative approach would be overwriting cout's streambuf, using fputs for the actual output:
#include <iostream>
#include <sstream>
#include <Windows.h>
class MBuf: public std::stringbuf {
public:
int sync() {
fputs( str().c_str(), stdout );
str( "" );
return 0;
}
};
int main() {
SetConsoleOutputCP( CP_UTF8 );
setvbuf( stdout, nullptr, _IONBF, 0 );
MBuf buf;
std::cout.rdbuf( &buf );
std::cout << u8"Greek: αβγδ\n" << std::flush;
}
I turned off output buffering here to prevent it to interfere with unfinished UTF-8 byte sequences.
Set the console output encoding to UTF-8 using the following Windows API call:
SetConsoleOutputCP(65001);
Documentation for that function is available on Windows Dev Center.
Forget everything you know about the Windows console and its Unicode/UTF-8 support (or rather lack of support). This is 2020 and it's a new world. This is not a direct answer to the question above, but rather an alternative that makes much more sense now, a new way that was not possible before.
Everybody's right, the root problem is the Windows console. But there's a new player in town, and it's Windows Terminal. Install and launch Windows Terminal. Use this program:
#include <iostream>
#include <windows.h>
int main()
{
SetConsoleOutputCP(CP_UTF8);
// or have your user set the console codepage: `chcp 65001`
std::cout << "\"u\" with two dots on top: \xc3\xbc\n";
std::cout << "chinese glyph for \"world\": \xe5\x80\xbc\n";
std::cout << "smiling emoji: \xf0\x9f\x98\x80\n";
return 0;
}
This program sends UTF-8 through a plain cout.
The output:

The command chcp 65001 or SetConsoleOutputCP(CP_UTF8) is required for a cmd tab in Windows Terminal, but it looks like it is not in a Powershell tab. Maybe Powershell is UTF-8 by default?
Rooting out the core issue, cmd, is now the best option in my opinion. Spread the word.
I had the same problem and wrote a very small library called libpu8 for this: https://github.com/jofeu/libpu8
For windows consoles, it replaces the streambufs of cin, cout and cerr so that they accept and produce utf-8 at the front end and talk to the console in UTF-16. On non-windows operating systems, or if cin, cout, cerr are attached to files/pipes and not consoles, it does nothing. It also translates the arguments of the C++ main() function to UTF-8 on windows.
Usage Example:
#include <libpu8.h>
#include <string>
#include <fstream>
#include <windows.h>
// argv are utf-8 strings when you use main_utf8 instead of main.
// main_utf8 is a macro. On Windows, it expands to a wmain that calls
// main_utf8 with converted strings.
int main_utf8(int argc, char** argv)
{
// this will also work on a non-Windows OS that supports utf-8 natively
std::ofstream f(u8widen(argv[1]));
if (!f)
{
// On Windows, use the "W" functions of the windows-api together
// with u8widen and u8narrow
MessageBoxW(0,
u8widen(std::string("Failed to open file ") + argv[1]).c_str(), 0, 0);
return 1;
}
std::string line;
// line will be utf-8 encoded regardless of whether cin is attached to a
// console, or a utf-8 file or pipe.
std::getline(std::cin, line);
// line will be displayed correctly on a console, and will be utf-8 if
// cout is attached to a file or pipe.
std::cout << "You said: " << line;
return 0;
}
At last, I've got it working. This answer combines input from Miles Budnek, Paul, and mkluwe with some research of my own. First, let me start with code that will work on Windows 10. After that, I'll walk you through the code and explain why it won't work out of the box on Windows 7.
#include <string>
#include <iostream>
#include <Windows.h>
#include <cstdio>
int main() {
// Set console code page to UTF-8 so console known how to interpret string data
SetConsoleOutputCP(CP_UTF8);
// Enable buffering to prevent VS from chopping up UTF-8 byte sequences
setvbuf(stdout, nullptr, _IOFBF, 1000);
std::string test = u8"Greek: αβγδ; German: Übergrößenträger";
std::cout << test << std::endl;
}
The code starts by setting the code page, as suggested by Miles Budnik. This will tell the console to interpret the byte stream it receives as UTF-8, not as some variation of ANSI.
Next, there is a problem in the STL code that comes with Visual Studio. std::cout prints its data to a stream buffer of type std::basic_filebuf. When that buffer receives a string (via std::basic_streambuf::sputn()), it won't pass it on to the underlying file as a whole. Instead, it will pass each byte separately. As explained by mkluwe, if the console receives a UTF-8 byte sequence as individual bytes, it won't interpret them as a single code point. Instead, it will treat them as multiple characters. Each byte within a UTF-8 byte sequence is an invalid code point on its own, so you'll see �'s instead. There is a related bug report for Visual Studio, but it was closed as By Design. The workaround is to enable buffering for the stream. As an added bonus, that will give you better performance. However, you may now need to regularly flush the stream as I do with std::endl, or your output may not show.
Lastly, the Windows console supports both raster fonts and TrueType fonts. As pointed out by Paul, raster fonts will simply ignore the console's code page. So non-ASCII Unicode characters will only work if the console is set to a TrueType Font. Up until Windows 7, the default is a raster font, so the user will have to change it manually. Luckily, Windows 10 changes the default font to Consolas, so this part of the problem should solve itself with time.