What is the most efficient way to encode an arbitrary GUID into readable ASCII (33-127)?

后端 未结 8 664
自闭症患者
自闭症患者 2020-11-30 22:08

The standard string representation of GUID takes about 36 characters. Which is very nice, but also really wasteful. I am wondering, how to encode it in the shortest possible

8条回答
  •  自闭症患者
    2020-11-30 22:26

    This is an old question, but I had to solve it in order for a system I was working on to be backward compatible.

    The exact requirement was for a client-generated identifier that would be written to the database and stored in a 20-character unique column. It never got shown to the user and was not indexed in any way.

    Since I couldn't eliminate the requirement, I really wanted to use a Guid (which is statistically unique) and if I could encode it losslessly into 20 characters, then it would be a good solution given the constraints.

    Ascii-85 allows you to encode 4 bytes of binary data into 5 bytes of Ascii data. So a 16 byte guid will just fit into 20 Ascii characters using this encoding scheme. A Guid can have 3.1962657931507848761677563491821e+38 discrete values whereas 20 characters of Ascii-85 can have 3.8759531084514355873123178482056e+38 discrete values.

    When writing to the database I had some concerns about truncation so no whitespace characters are included in the encoding. I also had issues with collation, which I addressed by excluding lowercase characters from the encoding. Also, it would only ever be passed through a paramaterized command, so any special SQL characters would be escaped automatically.

    I've included the C# code to perform Ascii-85 encoding and decoding in case it helps anyone out there. Obviously, depending on your usage you might need to choose a different character set as my constraints made me choose some unusual characters like 'ß' and 'Ø' - but that's the easy part:

    /// 
    /// This code implements an encoding scheme that uses 85 printable ascii characters 
    /// to encode the same volume of information as contained in a Guid.
    /// 
    /// Ascii-85 can represent 4 binary bytes as 5 Ascii bytes. So a 16 byte Guid can be 
    /// represented in 20 Ascii bytes. A Guid can have 
    /// 3.1962657931507848761677563491821e+38 discrete values whereas 20 characters of 
    /// Ascii-85 can have 3.8759531084514355873123178482056e+38 discrete values.
    /// 
    /// Lower-case characters are not included in this encoding to avoid collation 
    /// issues. 
    /// This is a departure from standard Ascii-85 which does include lower case 
    /// characters.
    /// In addition, no whitespace characters are included as these may be truncated in 
    /// the database depending on the storage mechanism - ie VARCHAR vs CHAR.
    /// 
    internal static class Ascii85
    {
        /// 
        /// 85 printable ascii characters with no lower case ones, so database 
        /// collation can't bite us. No ' ' character either so database can't 
        /// truncate it!
        /// Unfortunately, these limitation mean resorting to some strange 
        /// characters like 'Æ' but we won't ever have to type these, so it's ok.
        /// 
        private static readonly char[] kEncodeMap = new[]
        { 
            '0','1','2','3','4','5','6','7','8','9',  // 10
            'A','B','C','D','E','F','G','H','I','J',  // 20
            'K','L','M','N','O','P','Q','R','S','T',  // 30
            'U','V','W','X','Y','Z','|','}','~','{',  // 40
            '!','"','#','$','%','&','\'','(',')','`', // 50
            '*','+',',','-','.','/','[','\\',']','^', // 60
            ':',';','<','=','>','?','@','_','¼','½',  // 70
            '¾','ß','Ç','Ð','€','«','»','¿','•','Ø',  // 80
            '£','†','‡','§','¥'                       // 85
        };
    
        /// 
        /// A reverse mapping of the  array for decoding 
        /// purposes.
        /// 
        private static readonly IDictionary kDecodeMap;
    
        /// 
        /// Initialises the .
        /// 
        static Ascii85()
        {
            kDecodeMap = new Dictionary();
    
            for (byte i = 0; i < kEncodeMap.Length; i++)
            {
                kDecodeMap.Add(kEncodeMap[i], i);
            }
        }
    
        /// 
        /// Decodes an Ascii-85 encoded Guid.
        /// 
        /// The Guid encoded using Ascii-85.
        /// A Guid decoded from the parameter.
        public static Guid Decode(string ascii85Encoding)
        { 
            // Ascii-85 can encode 4 bytes of binary data into 5 bytes of Ascii.
            // Since a Guid is 16 bytes long, the Ascii-85 encoding should be 20
            // characters long.
            if(ascii85Encoding.Length != 20)
            {
                throw new ArgumentException(
                    "An encoded Guid should be 20 characters long.", 
                    "ascii85Encoding");
            }
    
            // We only support upper case characters.
            ascii85Encoding = ascii85Encoding.ToUpper();
    
            // Split the string in half and decode each substring separately.
            var higher = ascii85Encoding.Substring(0, 10).AsciiDecode();
            var lower = ascii85Encoding.Substring(10, 10).AsciiDecode();
    
            // Convert the decoded substrings into an array of 16-bytes.
            var byteArray = new[]
            {
                (byte)((higher & 0xFF00000000000000) >> 56),        
                (byte)((higher & 0x00FF000000000000) >> 48),        
                (byte)((higher & 0x0000FF0000000000) >> 40),        
                (byte)((higher & 0x000000FF00000000) >> 32),        
                (byte)((higher & 0x00000000FF000000) >> 24),        
                (byte)((higher & 0x0000000000FF0000) >> 16),        
                (byte)((higher & 0x000000000000FF00) >> 8),         
                (byte)((higher & 0x00000000000000FF)),  
                (byte)((lower  & 0xFF00000000000000) >> 56),        
                (byte)((lower  & 0x00FF000000000000) >> 48),        
                (byte)((lower  & 0x0000FF0000000000) >> 40),        
                (byte)((lower  & 0x000000FF00000000) >> 32),        
                (byte)((lower  & 0x00000000FF000000) >> 24),        
                (byte)((lower  & 0x0000000000FF0000) >> 16),        
                (byte)((lower  & 0x000000000000FF00) >> 8),         
                (byte)((lower  & 0x00000000000000FF)),  
            };
    
            return new Guid(byteArray);
        }
    
        /// 
        /// Encodes binary data into a plaintext Ascii-85 format string.
        /// 
        /// The Guid to encode.
        /// Ascii-85 encoded string
        public static string Encode(Guid guid)
        {
            // Convert the 128-bit Guid into two 64-bit parts.
            var byteArray = guid.ToByteArray();
            var higher = 
                ((UInt64)byteArray[0] << 56) | ((UInt64)byteArray[1] << 48) | 
                ((UInt64)byteArray[2] << 40) | ((UInt64)byteArray[3] << 32) |
                ((UInt64)byteArray[4] << 24) | ((UInt64)byteArray[5] << 16) | 
                ((UInt64)byteArray[6] << 8)  | byteArray[7];
    
            var lower = 
                ((UInt64)byteArray[ 8] << 56) | ((UInt64)byteArray[ 9] << 48) | 
                ((UInt64)byteArray[10] << 40) | ((UInt64)byteArray[11] << 32) |
                ((UInt64)byteArray[12] << 24) | ((UInt64)byteArray[13] << 16) | 
                ((UInt64)byteArray[14] << 8)  | byteArray[15];
    
            var encodedStringBuilder = new StringBuilder();
    
            // Encode each part into an ascii-85 encoded string.
            encodedStringBuilder.AsciiEncode(higher);
            encodedStringBuilder.AsciiEncode(lower);
    
            return encodedStringBuilder.ToString();
        }
    
        /// 
        /// Encodes the given integer using Ascii-85.
        /// 
        /// The  to 
        /// append the results to.
        /// The integer to encode.
        private static void AsciiEncode(
            this StringBuilder encodedStringBuilder, UInt64 part)
        {
            // Nb, the most significant digits in our encoded character will 
            // be the right-most characters.
            var charCount = (UInt32)kEncodeMap.Length;
    
            // Ascii-85 can encode 4 bytes of binary data into 5 bytes of Ascii.
            // Since a UInt64 is 8 bytes long, the Ascii-85 encoding should be 
            // 10 characters long.
            for (var i = 0; i < 10; i++)
            {
                // Get the remainder when dividing by the base.
                var remainder = part % charCount;
    
                // Divide by the base.
                part /= charCount;
    
                // Add the appropriate character for the current value (0-84).
                encodedStringBuilder.Append(kEncodeMap[remainder]);
            }
        }
    
        /// 
        /// Decodes the given string from Ascii-85 to an integer.
        /// 
        /// Decodes a 10 character Ascii-85 
        /// encoded string.
        /// The integer representation of the parameter.
        private static UInt64 AsciiDecode(this string ascii85EncodedString)
        {
            if (ascii85EncodedString.Length != 10)
            {
                throw new ArgumentException(
                    "An Ascii-85 encoded Uint64 should be 10 characters long.", 
                    "ascii85EncodedString");
            }
    
            // Nb, the most significant digits in our encoded character 
            // will be the right-most characters.
            var charCount = (UInt32)kEncodeMap.Length;
            UInt64 result = 0;
    
            // Starting with the right-most (most-significant) character, 
            // iterate through the encoded string and decode.
            for (var i = ascii85EncodedString.Length - 1; i >= 0; i--)
            {
                // Multiply the current decoded value by the base.
                result *= charCount;
    
                // Add the integer value for that encoded character.
                result += kDecodeMap[ascii85EncodedString[i]];
            }
    
            return result;
        }
    }
    

    Also, here are the unit tests. They aren't as thorough as I'd like, and I don't like the non-determinism of where Guid.NewGuid() is used, but they should get you started:

    /// 
    /// Tests to verify that the Ascii-85 encoding is functioning as expected.
    /// 
    [TestClass]
    [UsedImplicitly]
    public class Ascii85Tests
    {
        [TestMethod]
        [Description("Ensure that the Ascii-85 encoding is correct.")]
        [UsedImplicitly]
        public void CanEncodeAndDecodeAGuidUsingAscii85()
        {
            var guidStrings = new[]
            {
                "00000000-0000-0000-0000-000000000000",
                "00000000-0000-0000-0000-0000000000FF",
                "00000000-0000-0000-0000-00000000FF00",
                "00000000-0000-0000-0000-000000FF0000",
                "00000000-0000-0000-0000-0000FF000000",
                "00000000-0000-0000-0000-00FF00000000",
                "00000000-0000-0000-0000-FF0000000000",
                "00000000-0000-0000-00FF-000000000000",
                "00000000-0000-0000-FF00-000000000000",
                "00000000-0000-00FF-0000-000000000000",
                "00000000-0000-FF00-0000-000000000000",
                "00000000-00FF-0000-0000-000000000000",
                "00000000-FF00-0000-0000-000000000000",
                "000000FF-0000-0000-0000-000000000000",
                "0000FF00-0000-0000-0000-000000000000",
                "00FF0000-0000-0000-0000-000000000000",
                "FF000000-0000-0000-0000-000000000000",
                "FF000000-0000-0000-0000-00000000FFFF",
                "00000000-0000-0000-0000-0000FFFF0000",
                "00000000-0000-0000-0000-FFFF00000000",
                "00000000-0000-0000-FFFF-000000000000",
                "00000000-0000-FFFF-0000-000000000000",
                "00000000-FFFF-0000-0000-000000000000",
                "0000FFFF-0000-0000-0000-000000000000",
                "FFFF0000-0000-0000-0000-000000000000",
                "00000000-0000-0000-0000-0000FFFFFFFF",
                "00000000-0000-0000-FFFF-FFFF00000000",
                "00000000-FFFF-FFFF-0000-000000000000",
                "FFFFFFFF-0000-0000-0000-000000000000",
                "00000000-0000-0000-FFFF-FFFFFFFFFFFF",
                "FFFFFFFF-FFFF-FFFF-0000-000000000000",
                "FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF",
                "1000000F-100F-100F-100F-10000000000F"
            };
    
            foreach (var guidString in guidStrings)
            {
                var guid = new Guid(guidString);
                var encoded = Ascii85.Encode(guid);
    
                Assert.AreEqual(
                    20, 
                    encoded.Length, 
                    "A guid encoding should not exceed 20 characters.");
    
                var decoded = Ascii85.Decode(encoded);
    
                Assert.AreEqual(
                    guid, 
                    decoded, 
                    "The guids are different after being encoded and decoded.");
            }
        }
    
        [TestMethod]
        [Description(
            "The Ascii-85 encoding is not susceptible to changes in character case.")]
        [UsedImplicitly]
        public void Ascii85IsCaseInsensitive()
        {
            const int kCount = 50;
    
            for (var i = 0; i < kCount; i++)
            {
                var guid = Guid.NewGuid();
    
                // The encoding should be all upper case. A reliance 
                // on mixed case will make the generated string 
                // vulnerable to sql collation.
                var encoded = Ascii85.Encode(guid);
    
                Assert.AreEqual(
                    encoded, 
                    encoded.ToUpper(), 
                    "The Ascii-85 encoding should produce only uppercase characters.");
            }
        }
    }
    

    I hope this saves somebody some trouble.

    Also, if you find any bugs then let me know ;-)

提交回复
热议问题