compression algorithm for sorted integers

后端 未结 6 1871
我寻月下人不归
我寻月下人不归 2020-12-09 11:12

I have a large sequence of random integers sorted from the lowest to the highest. The numbers start from 1 bit and end near 45 bits. In the beginning of the list I have numb

6条回答
  •  旧巷少年郎
    2020-12-09 12:19

    There's a very simple and fairly effective compression technique which can be used for sorted integers in a known range. Like most compression schemes, it is optimized for serial access, although you can build an index to speed up random access if needed.

    It's a type of delta encoding (i.e. each number is represented by the distance from the previous one), consisting of a vector of codes which are either

    • a single 1-bit, representing a delta of 2k which is added to the delta in the following code, or

    • a 0-bit followed by a k-bit delta, indicating that the next number is the specified delta from the previous one.

    For example, if k is 4, the sequence:

    00011 1 1 00000 1 00001

    codes three numbers. The first four-bit encoding (3) is the first delta, taken from an initial value of 0, so the first number is 3. The next two solitary 1's accumulate to a delta of 2·24, or 32, which is added to the following delta of 0000, for a total of 32. So the second number is 3+32=35. Finally, the last delta is a single 24 plus 1, total 17, and the third number is 35+17=52.

    The 1-bit indicates that the next delta should be incremented by 2k (or, more generally, each delta is incremented by 2k times the number of immediately preceding 1-bits.)

    Another, possibly better, way of thinking of this is that each delta is coded as a variable length bit sequence: 1i0(1|0)k, representing a delta of i·2k+[the k-bit suffix]. But the first presentation aligns better with the optimality proof.

    Since each "1" code represents an increment of 2k, there cannot be more than m/2k of them, where m is the largest number in the set to be compressed. The remaining codes all correspond to numbers, and have a total length of n·(k + 1) where n is the size of the set. The optimal value of k is roughly log2 m/n, which in your case would be 7 or 8.

    I did a quick proof of concept of the algorithm, without worrying about optimizations. It's still plenty fast; sorting the random sample takes a lot longer than compressing/decompressing it. I tried it with a few different seeds and vector sizes from 16,400,000 to 31,000,000 with a value range of [0, 4,000,000,000). The bits used per data value ranged from 8.59 (n=31000000) to 9.45 (n=16400000). All of the tests were done with 7-bit suffixes; log2 m/n varies from 7.01 (n=31000000) to 7.93 (n=16400000). I tried with 6-bit and 8-bit suffixes; except in the case of n=31000000 where the 6-bit suffixes were slightly smaller, the 7-bit suffix was always the best. So I guess that the optimal k is not exactly floor(log2 m/n) but it's not far off.

    Compression code:

    void Compress(std::ostream& os,
                  const std::vector& v,
                  unsigned long k = 0) {
      BitOut out(os);
      out.put(v.size(), 64);
      if (v.size()) {
        unsigned long twok;
        if (k == 0) {
          unsigned long ratio = v.back() / v.size();
          for (twok = 1; twok <= ratio / 2; ++k, twok *= 2) { }
        } else {
          twok = 1 << k;
        }
        out.put(k, 32);
    
        unsigned long prev = 0;
        for (unsigned long val : v) {
          while (val - prev >= twok) { out.put(1); prev += twok; }
          out.put(0);
          out.put(val - prev, k);
          prev = val;
        }
      }
      out.flush(1);
    }
    

    Decompression:

    std::vector Decompress(std::istream& is) {
      BitIn in(is);
      unsigned long size = in.get(64);
      if (size) {
        unsigned long k = in.get(32);
        unsigned long twok = 1 << k;
    
        std::vector v;
        v.reserve(size);
        unsigned long prev = 0;
        for (; size; --size) {
          while (in.get()) prev += twok;
          prev += in.get(k);
          v.push_back(prev);
        }
      }
      return v;
    }
    

    It can be a bit awkward to use variable-length encodings; an alternative is to store the first bit of each code (1 or 0) in a bit vector, and the k-bit suffixes in a separate vector. This would be particularly convenient if k is 8.

    A variant, which results in slight longer files but is a bit easier to build indexes for, is to only use the 1-bits as deltas. Then the deltas are always a·2k for some a, possibly 0, where a is the number of consecutive 1 bits preceding the suffix code. The index then consists of the locations of every Nth 1-bit in the bit vector, and the corresponding index into the suffix vector (i.e. the index of the suffix corresponding with the next 0 in the bit vector).


提交回复
热议问题