问题
I had a problem with opening UTF-8 path files. Path that has a UTF-8 char (like Cyrillic or Latin). I found a way to solve that with _wfopen
but the way a solved it was when I encode the UTF-8 char with UTF by hand (\Uxxxx).
Is there a function, macro or anything that when I supply the string (path) it will return the Unicode??
Something like this: https://www.branah.com/unicode-converter
I tried with MultiByteToWideChar
but it returns some Hex numbers that are not relavent.
Tried:
std::wstring s2ws(const std::string& s)
{
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
wchar_t* buf = new wchar_t[len];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
std::wstring r(buf);
delete[] buf;
return r;
}
std::wstring stemp = s2ws(x);
LPCWSTR result = stemp.c_str();
The result I get: 0055F7E8
Thank you in advance
Update:
I installed boost, and now I am trying to do it with boost. Can some one maybe help me out with boost.
So I have a path:
wchar_t path[100] = _T("čaćšžđ\\test.txt");
I need it converted to:
wchar_t s[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");
回答1:
Here's a way to convert between UTF-8 and UTF-16 on Windows, as well as showing the real values of the stored code units for both input and output:
#include <codecvt>
#include <iostream>
#include <iomanip>
#include <string>
int main() {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
std::string s = "test";
std::cout << std::hex << std::setfill('0');
std::cout << "Input `char` data: ";
for (char c : s) {
std::cout << std::setw(2) << static_cast<unsigned>(static_cast<unsigned char>(c)) << ' ';
}
std::cout << '\n';
std::wstring ws = convert.from_bytes(s);
std::cout << "Output `wchar_t` data: ";
for (wchar_t wc : ws) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
}
Understanding the real values of the input and output is important because otherwise you may not correctly understand the transformation that you really need. For example it looks to me like there may be some confusion as to how VC++ deals with encodings, and what \Uxxxxxxxx
and \uxxxx
actually do in C++ source code (e.g., they don't necessarily produce UTF-8 data).
Try using code like that shown above to see what your input data really is.
To emphasize what I've written above; there are strong indications that you may not correctly understand the processing that's being done on your input, and you need to thoroughly check it.
The above program does correctly transform the UTF-8 representation of ć (U+0107) into the single 16-bit code unit 0x0107
, if you replace the test string with the following:
std::string s = "\xC4\x87"; // UTF-8 representation of U+0107
The output of the program, on Windows using Visual Studio, is then:
Input
char
data: c4 87
Outputwchar_t
data: 0107
This is in contrast to if you use test strings such as:
std::string s = "ć";
Or
std::string s = "\u0107";
Which may result in the following output:
Input
char
data: 3f
Outputwchar_t
data: 003f
The problem here is that Visual Studio does not use UTF-8 as the encoding for strings without some trickery, so your request to convert from UTF-8 probably isn't what you actually need; or you do need conversion from UTF-8, but you're testing potential conversion routines using input that differs from your real input.
So I have a path: wchar_t path[100] = _T("čaćšžđ\test.txt");
I need it converted to:
wchar_t s[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\test.txt");
Okay, so if I understand correctly, your actual problem is that the following fails:
wchar_t path[100] = _T("čaćšžđ\\test.txt");
FILE *f = _wfopen(path, L"w");
But if you instead write the string like:
wchar_t path[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");
Then the _wfopen
call succeeds and opens the file you want.
First of all, this has absolutely nothing to do with UTF-8. I assume you found some workaround using a char
string and converting that to wchar_t
and you somehow interpreted this as involving UTF-8, or something.
What encoding are you saving the source code with? Is the string L"čaćšžđ\\test.txt"
actually being saved properly? Try closing the source file and reopening it. If some characters show up replaced by ?
, then part of your problem is the source file encoding. In particular this is true of the default encoding used by Windows in most of North America and Western Europe: "Western European (Windows) - Codepage 1252".
You can also check the output of the following program:
#include <iomanip>
#include <iostream>
int main() {
wchar_t path[16] = L"čaćšžđ\\test.txt";
std::cout << std::hex << std::setfill('0');
for (wchar_t wc : path) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
wchar_t s[16] = L"\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt";
for (wchar_t wc : s) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
}
Another thing you need to understand is that the \uxxxx
form of writing characters, called Universal Character Names or UCNs, is not a form that you can convert strings to and from in C++. By the time you've compiled the program and it's running, i.e. by the time any code you write could be attempting to produce strings containing \uxxxx
, the time when UCNs are interpreted by the compiler as different characters is long past. The only UCNs that will work are ones that are written directly in the source file.
Also, you're using _T()
incorrectly. IMO You shouldn't be using TCHAR
and the related macros at all, but if you do use it then you ought to use it consistently: don't mix TCHAR
APIs with explicit use of the *W APIs or wchar_t
. The whole point of TCHAR
is to allow code to be independent and switch between those wchar_t
and Microsoft's "ANSI" APIs, so using TCHAR
and then hard coding an assumption that TCHAR
is wchar_t
defeats the entire purpose.
You should just write:
wchar_t path[100] = L"čaćšžđ\\test.txt";
回答2:
Your code is Windows-specific, and you're using Visual C++. So, just use wide literals. Visual C++ supports wide strings for file stream constructors.
It's as simple as that ‐ when you don't require portability.
#include <fstream>
#include <iostream>
#include <stdlib.h>
using namespace std;
auto main() -> int
{
wchar_t const path[] = L"cacšžd/test.txt";
ifstream f( path );
int ch;
while( (ch = f.get()) != EOF )
{
cout.put( ch );
}
}
Note, however, that this code is Visual C++ specific. That's reasonable for Windows-specific code. Possibly with C++17 we will have Boost file system library adopted into the standard library, and then for conformance g++ will ideally offer the constructor used here.
回答3:
The problem was that I was saving the CPP file as ANSI... I had to convert it to UTF-8. I tried this before posting but VS 2015 turns it into ANSI, I had to change it in VS so I could get it working.
I tried opening the cpp file with notepad++ and changing the encoding but when I turn on VS it automatically returns. So I was looking to Save As
option but there is no encoding option. Finally i found it, in Visual Studio 2015
File -> Advanced Save Options in the Encoding dropdown change it to Unicode
One thing that is still strange to me, how did VS display the characters normally but when I opened the file in N++ there was ? (like it was supposed to be, because of ANSI)?
来源:https://stackoverflow.com/questions/35329149/coding-a-path-in-unicode-c