The following code shows unexpected behaviour on my machine (tested with Visual C++ 2008 SP1 on Windows XP and VS 2012 on Windows 7):
#include
It's time to close this now. Stephan T. Lavavej says the behaviour is "by design", although I cannot follow this explanation.
My current knowledge is: Windows XP console in UTF-8 codepage does not work with C++ iostreams.
Windows XP is getting out of fashion now and so does VS 2008. I'd be interested to hear if the problem still exists on newer Windows systems.
On Windows 7 the effect is probably due to the way the C++ streams output characters. As seen in an answer to Properly print utf8 characters in windows console, UTF-8 output fails with C stdio when printing one byte after after another like putc('\xc3'); putc('\xbc');
as well. Perhaps this is what C++ streams do here.
I understand the question is quite old, but if someone would still be interested, below is my solution. I've implemented a quite simple std::streambuf descendant and then passed it to each of standard streams on the very beginning of program execution.
This allows you to use UTF-8 everywhere in your program. On input, data is taken from console in Unicode and then converted and returned to you in UTF-8. On output the opposite is done, taking data from you in UTF-8, converting it to Unicode and sending to console. No issues found so far.
Also note, that this solution doesn't require any codepage modification, with either SetConsoleCP
, SetConsoleOutputCP
or chcp
, or something else.
That's the stream buffer:
class ConsoleStreamBufWin32 : public std::streambuf
{
public:
ConsoleStreamBufWin32(DWORD handleId, bool isInput);
protected:
// std::basic_streambuf
virtual std::streambuf* setbuf(char_type* s, std::streamsize n);
virtual int sync();
virtual int_type underflow();
virtual int_type overflow(int_type c = traits_type::eof());
private:
HANDLE const m_handle;
bool const m_isInput;
std::string m_buffer;
};
ConsoleStreamBufWin32::ConsoleStreamBufWin32(DWORD handleId, bool isInput) :
m_handle(::GetStdHandle(handleId)),
m_isInput(isInput),
m_buffer()
{
if (m_isInput)
{
setg(0, 0, 0);
}
}
std::streambuf* ConsoleStreamBufWin32::setbuf(char_type* /*s*/, std::streamsize /*n*/)
{
return 0;
}
int ConsoleStreamBufWin32::sync()
{
if (m_isInput)
{
::FlushConsoleInputBuffer(m_handle);
setg(0, 0, 0);
}
else
{
if (m_buffer.empty())
{
return 0;
}
std::wstring const wideBuffer = utf8_to_wstring(m_buffer);
DWORD writtenSize;
::WriteConsoleW(m_handle, wideBuffer.c_str(), wideBuffer.size(), &writtenSize, NULL);
}
m_buffer.clear();
return 0;
}
ConsoleStreamBufWin32::int_type ConsoleStreamBufWin32::underflow()
{
if (!m_isInput)
{
return traits_type::eof();
}
if (gptr() >= egptr())
{
wchar_t wideBuffer[128];
DWORD readSize;
if (!::ReadConsoleW(m_handle, wideBuffer, ARRAYSIZE(wideBuffer) - 1, &readSize, NULL))
{
return traits_type::eof();
}
wideBuffer[readSize] = L'\0';
m_buffer = wstring_to_utf8(wideBuffer);
setg(&m_buffer[0], &m_buffer[0], &m_buffer[0] + m_buffer.size());
if (gptr() >= egptr())
{
return traits_type::eof();
}
}
return sgetc();
}
ConsoleStreamBufWin32::int_type ConsoleStreamBufWin32::overflow(int_type c)
{
if (m_isInput)
{
return traits_type::eof();
}
m_buffer += traits_type::to_char_type(c);
return traits_type::not_eof(c);
}
The usage then is as follows:
template<typename StreamT>
inline void FixStdStream(DWORD handleId, bool isInput, StreamT& stream)
{
if (::GetFileType(::GetStdHandle(handleId)) == FILE_TYPE_CHAR)
{
stream.rdbuf(new ConsoleStreamBufWin32(handleId, isInput));
}
}
// ...
int main()
{
FixStdStream(STD_INPUT_HANDLE, true, std::cin);
FixStdStream(STD_OUTPUT_HANDLE, false, std::cout);
FixStdStream(STD_ERROR_HANDLE, false, std::cerr);
// ...
std::cout << "\xc3\xbc" << std::endl;
// ...
}
Left out wstring_to_utf8
and utf8_to_wstring
could easily be implemented with WideCharToMultiByte
and MultiByteToWideChar
WinAPI functions.
Oi. Congratulations on finding a way to change the code page of the console from inside your program. I didn't know about that call, I always had to use chcp.
I'm guessing the C++ default locale is getting involved. By default it will use the code page provide by GetThreadLocale() to determine the text encoding of non-wstring stuff. This generally defaults to CP1252. You could try using SetThreadLocale() to get to UTF-8 (if it even does that, can't recall), with the hope that std::locale defaults to something that can handle your UTF-8 encoding.