Encoding an integer in 7-bit format of C# BinaryReader.ReadString

C#'s BinaryReader has a function that according to MSDN, reads an integer encoded as "seven bit integer", and then reads a string with the length of this integer.

Is there a clear documentation for the seven bit integer format (I have a rough understanding that the MSB or the LSB marks whether there are more bytes to read, and the rest bits are the data, but I'll be glad for something more exact).

Even better, is there a C implementation for reading and writing numbers in this format?

Well, the documentation for BinaryReader.Read7BitEncodedInt already says, that it expects the value to be written with BinaryWriter.Write7BitEncodedInt and that method documentation details the format:

The integer of the value parameter is written out seven bits at a time, starting with the seven least-significant bits. The high bit of a byte indicates whether there are more bytes to be written after this one.

If value will fit in seven bits, it takes only one byte of space. If value will not fit in seven bits, the high bit is set on the first byte and written out. value is then shifted by seven bits and the next byte is written. This process is repeated until the entire integer has been written.

So the integer 1259551277, in binary 1001011000100110011101000101101 will be converted into that 7-bit format as follows:

Remaining integer                 encoded bytes
1001011000100110011101000101101
100101100010011001110100          00101101
10010110001001100                 10101101 01110100
1001011000                        10101101 11110100 01001100
100                               10101101 11110100 11001100 01011000
0                                 10101101 11110100 11001100 11011000 00000100

I'm not that confident in my C skills right now to provide a working implementation, though. But it's not very hard to do, based on that description.

Basically, the idea behind a 7-bit encoded Int32 is to reduce the number of bytes required for small values. It works like this:

The first 7 least significant bits of the original value are taken.
If this value exceeds what can fit into these 7 bits, the 8th bit is set to 1, indicating another byte has to be read. Otherwise that bit is 0 and reading ends here.
The next byte is read, its value shifted left by 7 bits and ORed to the previously read value to combine them together. Again, the 8th bit of this byte indicates if another byte must be read (shifting the read value further 7 more times).
This continues until a maximum of 5 bytes has been read (because even Int32.MaxValue would not require more than 5 bytes when only 1 bit is stolen from each byte). If the highest bit of the 5th byte is still set, you've read something that isn't a 7-bit encoded Int32.

Note that since it is written byte-by-byte, endianness doesn't matter at all for these values. The following number of bytes are required for a given range of values:

1 byte: 0 to 127
2 bytes: 128 to 16,383
3 bytes: 16,384 to 2,097,151
4 bytes: 2,097,152 to 268,435,455
5 bytes: 268,435,456 to 2,147,483,647 (Int32.MaxValue) and -2,147,483,648 (Int32.MinValue) to -1

As you can see, the implementation is kinda dumb and always requires 5 bytes for negative values as the sign bit is the 32nd bit of the original value, always ending up in the 5th byte.

Thus, I do not recommend it for negative values or values bigger than ~250,000,000. I've only seen it used internally for the string length prefix of .NET strings (those you can read/write with BinaryReader.ReadString and BinaryReader.WriteString), describing the number of characters following of which the string consists, only having positive values.

While you can look up the original .NET source, I use different implementations in my BinaryData library.

I had to explore this 7-bit format also. In one of my projects I pack some data into files using C#'s BinaryWriter and then unpack it again with BinaryReader, which works nicely.

Later I needed to implement a reader for this project's packed files for Java, too. Java has a class named DataInputStream (in java.io package), which has some similar methods. Unfortunately DataInputStream's data interpretation is very different than C#'s.

To solve my problem I ported C#'s BinaryReader to Java myself by writing a class that extends java.io.DataInputStream. Here is the method I wrote, which does exactly the same as C#'s BinaryReader.readString():

public String csReadString() throws IOException {
    int stringLength = 0;
    boolean stringLengthParsed = false;
    int step = 0;
    while(!stringLengthParsed) {
        byte part = csReadByte();
        stringLengthParsed = (((int)part >> 7) == 0);
        int partCutter = part & 127;
        part = (byte)partCutter;
        int toAdd = (int)part << (step*7);
        stringLength += toAdd;
        step++;
    }
    char[] chars = new char[stringLength];
    for(int i = 0; i < stringLength; i++) {
        chars[i] = csReadChar();
    }
    return new String(chars);
}

/*
 * Parameters:  plOutput[out] - The decoded integer
 *              pbyInput[in]  - Buffer containing encoded integer
 * Returns:     Number of bytes used to encode the integer
 */
int SevenBitEncodingToInteger(int *plOutput, char *pbyInput)
{
    int lSize = 0;
    int lTemp = 0;
    while(true)
    {
        lTemp += pbyInput[lSize] & 0x7F;
        if(pbyInput[lSize++] > 127)
            lTemp <<= 7;
        else
            break;
    }
    *plOutput = lTemp;
    return lSize;
}

The format is described here: http://msdn.microsoft.com/en-us/library/system.io.binarywriter.write7bitencodedint.aspx

Write7BitEncodedInt method contains the description: The lowest 7 bits of each byte encode the next 7 bits of the number. The highest bit is set when there's another byte following.

来源：https://stackoverflow.com/questions/1550560/encoding-an-integer-in-7-bit-format-of-c-sharp-binaryreader-readstring

标签

.net

serialization

binary

7-bit