Is there a standard technique for packing binary data into a UTF-16 string?

百般思念 提交于 2019-11-29 06:26:22

I stumbled onto Base16k after reading your question. Not strictly a standard but it seems to work well and was easy enough to implement in C#.

Dave Van den Eynde

May I suggest you do use base64? It may not be the most efficient way to do it storagewise, but it does have its benefits:

  1. Your worries about the code are over.
  2. You'll have the least compatibility problems with other players, if there are any.
  3. Should the encoded string ever be considered as ASCII during conversion, export, import, backup, restore, whatever, you won't have any problems either.
  4. Should you ever drop dead or end up under a bus or something, any programmer who ever gets their hands on the comment field will instantly know that it's base64 and not assume it's all encrypted or something.

Firstly, remember that Unicode doesn't mean 16 bits. The fact that System.String uses UTF-16 internally is neither here nor there. Unicode characters are abstract - they only gain bit representations through encodings.

You say "my storage is a System.String" - if that's the case, you cannot talk about bits and bytes, only Unicode characters. System.String certainly has it's own internal encoding, but (in theory) that could be different.

Incidentally, if you believe that the internal representation of System.String is too memory-inefficient for Base64-encoded data, why aren't you also worrying about Latin/Western strings?

If you want to store binary data in a System.String, you need a mapping between collections of bits and characters.

Option A: There's a pre-made one in the shape of Base64 encoding. As you've pointed out, this encodes six bits of data per character.

Option B: If you want to pack more bits per character, then you'll need to make an array (or encoding) of 128, 256, 512, etc Unicode characters, and pack 7, 8, 9, etc bits of data per character. Those characters need to be real Unicode characters.

To answer your question simply, yes there is a standard, it's Base64-encoding.

Is this a real problem? Do you have perf data to support your idea of not using Base64?

Peter S. Housel

You could treat the binary data as UTF-8b. The UTF-8b encoding assumes that the bytes are UTF-8 multibyte sequences, but has a fallback encoding for things that are not.

Here's a C# version of Jim Beveridge's C++ implementation:

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.Linq;


//
// Base16k.cpp : Variant of base64 used to efficiently encode  binary into Unicode UTF16 strings. Based on work by
// Markus Scherer at https://sites.google.com/site/markusicu/unicode/base16k
//
// This code is hereby placed in the Public Domain.
// Jim Beveridge, November 29, 2011.
//
// C# port of http://qualapps.blogspot.com/2011/11/base64-for-unicode-utf16.html
// This code is hereby placed in the Public Domain.
// J. Daniel Smith, February 23, 2015
//

namespace JDanielSmith
{
    public static partial class Convert
    {
        /// <summary>
        /// Encode a binary array into a Base16k string for Unicode.
        /// </summary>
        public static string ToBase16kString(byte[] inArray)
        {
            int len = inArray.Length;

            var sb = new StringBuilder(len*6/5);
            sb.Append(len);

            int code = 0;

            for (int i=0; i<len; ++i)
            {
                byte byteValue = inArray[i];
                switch (i%7)
                {
                case 0:
                    code = byteValue<<6;
                    break;

                case 1:
                    code |= byteValue>>2;
                    code += 0x5000;
                    sb.Append(System.Convert.ToChar(code));
                    code = (byteValue&3)<<12;
                    break;

                case 2:
                    code |= byteValue<<4;
                    break;

                case 3:
                    code |= byteValue>>4;
                    code+=0x5000;
                    sb.Append(System.Convert.ToChar(code));
                    code = (byteValue&0xf)<<10;
                    break;

                case 4:
                    code |= byteValue<<2;
                    break;

                case 5:
                    code|=byteValue>>6;
                    code+=0x5000;
                    sb.Append(System.Convert.ToChar(code));
                    code=(byteValue&0x3f)<<8;
                    break;

                case 6:
                    code|=byteValue;
                    code+=0x5000;
                    sb.Append(System.Convert.ToChar(code));
                    code=0;
                    break;
                }
            }

            // emit a character for remaining bits
            if (len%7 != 0) {
                code += 0x5000;
                sb.Append(System.Convert.ToChar(code));
            }

            return sb.ToString();
        }

        /// <summary>
        ///  Decode a Base16k string for Unicode into a binary array.
        /// </summary>
        public static byte[] FromBase16kString(string s)
        {
            // read the length
            var r = new Regex(@"^\d+", RegexOptions.None, matchTimeout: TimeSpan.FromMilliseconds(100));
            Match m = r.Match(s);
            if (!m.Success)
                return null;

            int length;
            if (!Int32.TryParse(m.Value, out length))
                return null;

            var buf = new List<byte>(length);

            int pos=0;  // position in s
            while ((pos < s.Length) && (s[pos] >= '0' && s[pos] <= '9'))
                ++pos;

            // decode characters to bytes
            int i = 0;    // byte position modulo 7 (0..6 wrapping around)
            int code=0;
            byte byteValue=0;

            while (length-- > 0)
            {
                if (((1<<i)&0x2b)!=0)
                {
                    // fetch another Han character at i=0, 1, 3, 5
                    if(pos >= s.Length)
                    {
                        // Too few Han characters representing binary data.
                        System.Diagnostics.Debug.Assert(pos < s.Length);
                        return null;
                    }

                    code=s[pos++]-0x5000;
                }

                switch (i%7)
                {
                case 0:
                    byteValue = System.Convert.ToByte(code>>6);
                    buf.Add(byteValue);
                    byteValue = System.Convert.ToByte((code&0x3f)<<2);
                    break;

                case 1:
                    byteValue |= System.Convert.ToByte(code>>12);
                    buf.Add(byteValue);
                    break;

                case 2:
                    byteValue = System.Convert.ToByte((code>>4)&0xff);
                    buf.Add(byteValue);
                    byteValue = System.Convert.ToByte((code&0xf)<<4);
                    break;

                case 3:
                    byteValue |= System.Convert.ToByte(code>>10);
                    buf.Add(byteValue);
                    break;

                case 4:
                    byteValue = System.Convert.ToByte((code>>2)&0xff);
                    buf.Add(byteValue);
                    byteValue = System.Convert.ToByte((code&3)<<6);
                    break;

                case 5:
                    byteValue |= System.Convert.ToByte(code>>8);
                    buf.Add(byteValue);
                    break;

                case 6:
                    byteValue = System.Convert.ToByte(code&0xff);
                    buf.Add(byteValue);
                    break;
                }

                // advance to the next byte position
                if(++i==7)
                    i=0;
            }

            return buf.ToArray();
        }
    }
}

namespace Base16kCS
{
    class Program
    {
        static void Main(string[] args)
        {
            var drand = new Random();

            // Create 500 different binary objects, then encode and decode them.
            // The first 16 objects will have length 0,1,2 ... 16 to test boundary conditions.
            for (int loop = 0; loop < 500; ++loop)
            {
                Console.WriteLine("{0}", loop);

                int dw = drand.Next(128000);
                var org = new List<byte>(dw);
                for (int i = 0; i < dw; ++i)
                    org.Add(Convert.ToByte(drand.Next(256)));

                if (loop < 16)
                    org = org.Take(loop).ToList();

                string wstr = JDanielSmith.Convert.ToBase16kString(org.ToArray());

                byte[] bin = JDanielSmith.Convert.FromBase16kString(wstr);

                System.Diagnostics.Debug.Assert(org.SequenceEqual(bin));
            }
        }
    }
}
Jonathan C Dickinson

I fooled around with direct char arrays, and your one failing case works with my implementation. The code has been tested well: so do your tests first.

You could speed this up by using unsafe code. But I am sure UnicodeEncoding is just as slow (if not slower).

/// <summary>
/// Represents an encoding that packs bytes tightly into a string.
/// </summary>
public class ByteEncoding : Encoding
{
    /// <summary>
    /// Gets the Byte Encoding instance.
    /// </summary>
    public static readonly Encoding Encoding = new ByteEncoding();

    private ByteEncoding()
    {
    }

    public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex)
    {
        for (int i = 0; i < chars.Length; i++)
        {
            // Work out some indicies.
            int j = i * 2;
            int k = byteIndex + j;

            // Get the bytes.
            byte[] packedBytes = BitConverter.GetBytes((short) chars[charIndex + i]);

            // Unpack them.
            bytes[k] = packedBytes[0];
            bytes[k + 1] = packedBytes[1];
        }

        return chars.Length * 2;
    }

    public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex)
    {
        for (int i = 0; i < byteCount; i += 2)
        {
            // Work out some indicies.
            int j = i / 2;
            int k = byteIndex + i;

            // Make sure we don't read too many bytes.
            byte byteB = 0;
            if (i + 1 < byteCount)
            {
                byteB = bytes[k + 1];
            }

            // Add it to the array.
            chars[charIndex + j] = (char) BitConverter.ToInt16(new byte[] { bytes[k], byteB }, 0);
        }

        return (byteCount / 2) + (byteCount % 2); // Round up.
    }

    public override int GetByteCount(char[] chars, int index, int count)
    {
        return count * 2;
    }

    public override int GetCharCount(byte[] bytes, int index, int count)
    {
        return (count / 2) + (count % 2);
    }

    public override int GetMaxByteCount(int charCount)
    {
        return charCount * 2;
    }

    public override int GetMaxCharCount(int byteCount)
    {
        return (byteCount / 2) + (byteCount % 2);
    }
}

Here is some test code:

    static void Main(string[] args)
    {
        byte[] original = new byte[256];

        // Note that we can't tell on the decode side how
        // long the array was if the original length is
        // an odd number. This will result in an
        // inconclusive result.
        for (int i = 0; i < original.Length; i++)
            original[i] = (byte) Math.Abs(i - 1);

        string packed = ByteEncoding.Encoding.GetString(original);
        byte[] unpacked = ByteEncoding.Encoding.GetBytes(packed);

        bool pass = true;

        if (original.Length != unpacked.Length)
        {
            Console.WriteLine("Inconclusive: Lengths differ.");
            pass = false;
        }

        int min = Math.Min(original.Length, unpacked.Length);
        for (int i = 0; i < min; i++)
        {
            if (original[i] != unpacked[i])
            {
                Console.WriteLine("Fail: Invalid at a position {0}.", i);
                pass = false;
            }
        }

        Console.WriteLine(pass ? "All Passed" : "Failure Present");

        Console.ReadLine();
    }

The test works, but you are going to have to test it with your API function.

There is another way to work around this limitation: although I am not sure how well it would work.

Firstly, you will need to figure out what type of string the API call is expecting - and what the structure of this string is. If I take a simple example, lets consider the .Net string:

  • Int32 _length;
  • byte[] _data;
  • byte _terminator = 0;

Add an overload to your API call, thus:

[DllImport("legacy.dll")]
private static extern void MyLegacyFunction(byte[] data);

[DllImport("legacy.dll")]
private static extern void MyLegacyFunction(string comment);

Then when you need to call the byte version you can do the following:

    public static void TheLegacyWisperer(byte[] data)
    {
        byte[] realData = new byte[data.Length + 4 /* _length */ + 1 /* _terminator */ ];
        byte[] lengthBytes = BitConverter.GetBytes(data.Length);
        Array.Copy(lengthBytes, realData, 4);
        Array.Copy(data, 0, realData, 4, data.Length);
        // realData[end] is equal to 0 in any case.
        MyLegacyFunction(realData);
    }
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!