问题
Usually we can get a string from a byte[] using something like
var result = Encoding.UTF8.GetString(bytes);
However, I am having this problem: my input is an IEnumerable<byte[]> bytes (implementation can be any structure of my choice). It is not guaranteed a character is within a byte[] (for example, a 2-byte UTF8 char can have its 1st byte in bytes[1][length - 1] and its 2nd byte in bytes[2][0]).
Is there anyway to decode them without merging/copying all the array together? UTF8 is main focus but it is better if other Encoding can be supported. If there is no other solution, I think implementing my own UTF8 reading would be the way.
I plan to stream them using a MemoryStream, however Encoding cannot work on Stream, just byte[]. If merged together, the potential result array may be very large (up to 4GB in List<byte[]> already).
I am using .NET Standard 2.0. I wish I could use 2.1 (as it is not released yet) and using Span<byte[]>, would be perfect for my case!
回答1:
The Encoding class can't deal with that directly, but the Decoder returned from Encoding.GetDecoder() can (indeed, that's its entire reason for existing). StreamReader uses a Decoder internally.
It's slightly fiddly to work with though, as it needs to populate a char[], rather than returning a string (Encoding.GetString() and StreamReader normally handle the business of populating the char[]).
The problem with using a MemoryStream is that you're copying all of the bytes from one array to another, for no gain. If all of your buffers are the same length, you can do this:
var decoder = Encoding.UTF8.GetDecoder();
// +1 in case it includes a work-in-progress char from the previous buffer
char[] chars = decoder.GetMaxCharCount(bufferSize) + 1;
foreach (var byteSegment in bytes)
{
int numChars = decoder.GetChars(byteSegment, 0, byteSegment.Length, chars, 0);
Debug.WriteLine(new string(chars, 0, numChars));
}
If the buffers have different lengths:
var decoder = Encoding.UTF8.GetDecoder();
char[] chars = Array.Empty<char>();
foreach (var byteSegment in bytes)
{
// +1 in case it includes a work-in-progress char from the previous buffer
int charsMinSize = decoder.GetMaxCharCount(bufferSize) + 1;
if (chars.Length < charsMinSize)
chars = new char[charsMinSize];
int numChars = decoder.GetChars(byteSegment, 0, byteSegment.Length, chars, 0);
Debug.WriteLine(new string(chars, 0, numChars));
}
回答2:
however Encoding cannot work on Stream, just byte[].
Correct but a StreamReader : TextReader can be linked to a Stream.
So just create that MemoryStream, push bytes in on one end and use ReadLine() on the other. I must say I have never tried that.
回答3:
Working code based on Henk's answer using StreamReader:
using (var memoryStream = new MemoryStream())
{
using (var reader = new StreamReader(memoryStream))
{
foreach (var byteSegment in bytes)
{
memoryStream.Seek(0, SeekOrigin.Begin);
await memoryStream.WriteAsync(byteSegment, 0, byteSegment.Length);
memoryStream.Seek(0, SeekOrigin.Begin);
Debug.WriteLine(await reader.ReadToEndAsync());
}
}
}
来源:https://stackoverflow.com/questions/54970472/can-the-encoding-api-decode-a-stream-noncontinuous-bytes