How to convert a string with character codes above 127 to a byte array properly?

家住魔仙堡 提交于 2020-01-13 12:55:07

问题


I am retrieving ASCII strings encoded with code page 437 from another system which I need to transform to Unicode so they can be mixed with other Unicode strings.

This is what I am working with:

var asciiString = "\u0094"; // 94 corresponds represents 'ö' in code page 437.

var asciiEncoding = Encoding.GetEncoding(437);
var unicodeEncoding = Encoding.Unicode;

// This is what I attempted to do but it seems not to be able to support the eight bit. Characters using the eight bit are replaced with '?' (0x3F)
var asciiBytes = asciiEncoding.GetBytes(asciiString);

// This work-around does the job, but there must be built in functionality to do this?
//var asciiBytes = asciiString.Select(c => (byte)c).ToArray();

// This piece of code happliy converts the character correctly to unicode { 0x94 } => { 0xF6, 0x0 } .
var unicodeBytes = Encoding.Convert(asciiEncoding, unicodeEncoding, asciiBytes);
var unicodeString = unicodeEncoding.GetString(unicodeBytes); // I want this to be 'ö'.

What I am struggling with is that I cannot find a suitable method in the .NET framework to transform a string with character codes above 127 to a byte array. This seems strange since there are support there to transform a byte array with characters above 127 to Unicode strings.

So my question is, is there any built in method to do this conversion properly or is my work-around the proper way to do it?


回答1:


var asciiString = "\u0094";

Whatever you name it, this will always be a Unicode string. .NET only has Unicode strings.

I am retrieving ASCII strings encoded with code page 437 from another system

Treat the incoming data as byte[], not as string.

var asciiBytes = new byte[] { 0x94 }; // 94 corresponds represents 'ö' in code page 437.

var asciiEncoding = Encoding.GetEncoding(437);    

var unicodeString = asciiEncoding.GetString(asciiBytes);



回答2:


\u0094 is Unicode code-point 0094, which is a control character; it is not ö. If you wanted ö, the correct string is

string s = "ö";

which is LATIN SMALL LETTER O WITH DIAERESIS, aka code-point 00F6.

So:

var s = "\u00F6"; // Identical to "ö"

Now we get our encoding:

var enc = Encoding.GetEncoding(437);
var bytes = enc.GetBytes(s);

And we find that it is a single-byte decimal 148, which is hex 94 - i.e. what you were after.

The significance here is that in C# when you use the "\uXXXX" syntax, the XXXX is always referring to Unicode code-points, not the encoded value in some particular encoding.




回答3:


You have to look earlier in the code. Once you have the data as a string, it has already been decoded. Any characters lost in that decoding is impossible to get back.

You need the input as bytes, so that you can use your encoding object for code page 437 to decode it into a string.

byte[] asciiData = new byte[] { 0x94 }; // character ö in codepage 437

Encoding asciiEncoding = Encoding.GetEncoding(437);

string unicodeString = asciiEncoding.GetString(asciiData);

Console.WriteLine(unicodeString);

Output:

ö


来源:https://stackoverflow.com/questions/11952474/how-to-convert-a-string-with-character-codes-above-127-to-a-byte-array-properly

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!