Can the Encoding API decode a Stream/noncontinuous bytes?

送分小仙女□ 提交于 2021-02-05 11:10:21

问题


Usually we can get a string from a byte[] using something like

var result = Encoding.UTF8.GetString(bytes);

However, I am having this problem: my input is an IEnumerable<byte[]> bytes (implementation can be any structure of my choice). It is not guaranteed a character is within a byte[] (for example, a 2-byte UTF8 char can have its 1st byte in bytes[1][length - 1] and its 2nd byte in bytes[2][0]).

Is there anyway to decode them without merging/copying all the array together? UTF8 is main focus but it is better if other Encoding can be supported. If there is no other solution, I think implementing my own UTF8 reading would be the way.

I plan to stream them using a MemoryStream, however Encoding cannot work on Stream, just byte[]. If merged together, the potential result array may be very large (up to 4GB in List<byte[]> already).

I am using .NET Standard 2.0. I wish I could use 2.1 (as it is not released yet) and using Span<byte[]>, would be perfect for my case!


回答1:


The Encoding class can't deal with that directly, but the Decoder returned from Encoding.GetDecoder() can (indeed, that's its entire reason for existing). StreamReader uses a Decoder internally.

It's slightly fiddly to work with though, as it needs to populate a char[], rather than returning a string (Encoding.GetString() and StreamReader normally handle the business of populating the char[]).

The problem with using a MemoryStream is that you're copying all of the bytes from one array to another, for no gain. If all of your buffers are the same length, you can do this:

var decoder = Encoding.UTF8.GetDecoder();
// +1 in case it includes a work-in-progress char from the previous buffer
char[] chars = decoder.GetMaxCharCount(bufferSize) + 1;
foreach (var byteSegment in bytes)
{
    int numChars = decoder.GetChars(byteSegment, 0, byteSegment.Length, chars, 0);
    Debug.WriteLine(new string(chars, 0, numChars));
}

If the buffers have different lengths:

var decoder = Encoding.UTF8.GetDecoder();
char[] chars = Array.Empty<char>();
foreach (var byteSegment in bytes)
{
    // +1 in case it includes a work-in-progress char from the previous buffer
    int charsMinSize = decoder.GetMaxCharCount(bufferSize) + 1;
    if (chars.Length < charsMinSize)
        chars = new char[charsMinSize];
    int numChars = decoder.GetChars(byteSegment, 0, byteSegment.Length, chars, 0);
    Debug.WriteLine(new string(chars, 0, numChars));
}



回答2:


however Encoding cannot work on Stream, just byte[].

Correct but a StreamReader : TextReader can be linked to a Stream.

So just create that MemoryStream, push bytes in on one end and use ReadLine() on the other. I must say I have never tried that.




回答3:


Working code based on Henk's answer using StreamReader:

    using (var memoryStream = new MemoryStream())
    {
        using (var reader = new StreamReader(memoryStream))
        {
            foreach (var byteSegment in bytes)
            {
                memoryStream.Seek(0, SeekOrigin.Begin);
                await memoryStream.WriteAsync(byteSegment, 0, byteSegment.Length);
                memoryStream.Seek(0, SeekOrigin.Begin);

                Debug.WriteLine(await reader.ReadToEndAsync());
            }
        }
    }


来源:https://stackoverflow.com/questions/54970472/can-the-encoding-api-decode-a-stream-noncontinuous-bytes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!