Best Compression algorithm for a sequence of integers

前端 未结 15 1585
离开以前
离开以前 2020-11-29 16:41

I have a large array with a range of integers that are mostly continuous, eg 1-100, 110-160, etc. All integers are positive. What would be the best algorithm to compress thi

相关标签:
15条回答
  • 2020-11-29 17:34

    We have written recent research papers that survey the best schemes for this problem. Please see:

    Daniel Lemire and Leonid Boytsov, Decoding billions of integers per second through vectorization,Software: Practice & Experience 45 (1), 2015. http://arxiv.org/abs/1209.2137

    Daniel Lemire, Nathan Kurz, Leonid Boytsov, SIMD Compression and the Intersection of Sorted Integers, Software: Practice and Experience (to appear) http://arxiv.org/abs/1401.6399

    They include an extensive experimental evaluation.

    You can find a complete implementation of all techniques in C++11 online: https://github.com/lemire/FastPFor and https://github.com/lemire/SIMDCompressionAndIntersection

    There are also C libraries: https://github.com/lemire/simdcomp and https://github.com/lemire/MaskedVByte

    If you prefer Java, please see https://github.com/lemire/JavaFastPFOR

    0 讨论(0)
  • 2020-11-29 17:34

    Well, i'm voting for smarter way. All you have to store is [int:startnumber][int/byte/whatever:number of iterations] in this case, you'll turn your example array into 4xInt value. After it you can compress as you want :)

    0 讨论(0)
  • 2020-11-29 17:35

    TurboPFor: Fastest Integer Compression

    • for C/C++ including Java Critical Natives/JNI Interface
    • SIMD accelerated integer compression
    • Scalar + Integrated (SIMD) differential/Zigzag encoding/decoding for sorted/unsorted integer lists
    • Full range 8/16/32/64 bits interger lists
    • Direct access
    • Benchmark app
    0 讨论(0)
  • 2020-11-29 17:37

    While you could design a custom algorithm specific to your stream of data, it's probably easier to use an off the shelf encoding algorithm. I ran a few tests of compression algorithms available in Java and found the following compression rates for a sequence of one million consecutive integers:

    None        1.0
    Deflate     0.50
    Filtered    0.34
    BZip2       0.11
    Lzma        0.06
    
    0 讨论(0)
  • 2020-11-29 17:38

    Your case is very similar to compression of indices in search engines. The popular compression algorithm used is the PForDelta algorithm and Simple16 algorithm. You can use the kamikaze library for your compression needs.

    0 讨论(0)
  • 2020-11-29 17:39

    What size are the numbers? In addition to the other answers, you could consider base-128 variant-length encoding, which lets you store smaller numbers in single bytes while still allowing larger numbers. The MSB means "there is another byte" - this is described here.

    Combine this with the other techniques so you are storing "skip size", "take size", "skip size", "take size" - but noting that neither "skip" nor "take" will ever be zero, so we'll subtract one from each (which lets you save an extra byte for a handful of values)

    So:

    1-100, 110-160
    

    is "skip 1" (assume start at zero as it makes things easier), "take 100", "skip 9", "take 51"; subtract 1 from each, giving (as decimals)

    0,99,8,50
    

    which encodes as (hex):

    00 63 08 32
    

    If we wanted to skip/take a larger number - 300, for example; we subtract 1 giving 299 - but that goes over 7 bits; starting with the little end, we encode blocks of 7 bits and an MSB to indicate continuation:

    299 = 100101100 = (in blocks of 7): 0000010 0101100
    

    so starting with the little end:

    1 0101100 (leading one since continuation)
    0 0000010 (leading zero as no more)
    

    giving:

    AC 02
    

    So we can encode large numbers easily, but small numbers (which sound typical for skip/take) take less space.

    You could try running this through "deflate", but it might not help much more...


    If you don't want to deal with all that messy encoding cruff yourself... if you can create the integer-array of the values (0,99,8,60) - you could use protocol buffers with a packed repeated uint32/uint64 - and it'll do all the work for you ;-p

    I don't "do" Java, but here's a full C# implementation (borrowing some of the encoding bits from my protobuf-net project):

    using System;
    using System.Collections.Generic;
    using System.IO;
    using System.Linq;
    static class Program
    {
        static void Main()
        {
            var data = new List<int>();
            data.AddRange(Enumerable.Range(1, 100));
            data.AddRange(Enumerable.Range(110, 51));
            int[] arr = data.ToArray(), arr2;
    
            using (MemoryStream ms = new MemoryStream())
            {
                Encode(ms, arr);
                ShowRaw(ms.GetBuffer(), (int)ms.Length);
                ms.Position = 0; // rewind to read it...
                arr2 = Decode(ms);
            }
        }
        static void ShowRaw(byte[] buffer, int len)
        {
            for (int i = 0; i < len; i++)
            {
                Console.Write(buffer[i].ToString("X2"));
            }
            Console.WriteLine();
        }
        static int[] Decode(Stream stream)
        {
            var list = new List<int>();
            uint skip, take;
            int last = 0;
            while (TryDecodeUInt32(stream, out skip)
                && TryDecodeUInt32(stream, out take))
            {
                last += (int)skip+1;
                for(uint i = 0 ; i <= take ; i++) {
                    list.Add(last++);
                }
            }
            return list.ToArray();
        }
        static int Encode(Stream stream, int[] data)
        {
            if (data.Length == 0) return 0;
            byte[] buffer = new byte[10];
            int last = -1, len = 0;
            for (int i = 0; i < data.Length; )
            {
                int gap = data[i] - 2 - last, size = 0;
                while (++i < data.Length && data[i] == data[i - 1] + 1) size++;
                last = data[i - 1];
                len += EncodeUInt32((uint)gap, buffer, stream)
                    + EncodeUInt32((uint)size, buffer, stream);
            }
            return len;
        }
        public static int EncodeUInt32(uint value, byte[] buffer, Stream stream)
        {
            int count = 0, index = 0;
            do
            {
                buffer[index++] = (byte)((value & 0x7F) | 0x80);
                value >>= 7;
                count++;
            } while (value != 0);
            buffer[index - 1] &= 0x7F;
            stream.Write(buffer, 0, count);
            return count;
        }
        public static bool TryDecodeUInt32(Stream source, out uint value)
        {
            int b = source.ReadByte();
            if (b < 0)
            {
                value = 0;
                return false;
            }
    
            if ((b & 0x80) == 0)
            {
                // single-byte
                value = (uint)b;
                return true;
            }
    
            int shift = 7;
    
            value = (uint)(b & 0x7F);
            bool keepGoing;
            int i = 0;
            do
            {
                b = source.ReadByte();
                if (b < 0) throw new EndOfStreamException();
                i++;
                keepGoing = (b & 0x80) != 0;
                value |= ((uint)(b & 0x7F)) << shift;
                shift += 7;
            } while (keepGoing && i < 4);
            if (keepGoing && i == 4)
            {
                throw new OverflowException();
            }
            return true;
        }
    }
    
    0 讨论(0)
提交回复
热议问题