compression algorithm for sorted integers

后端未结

关注

 6  1871

我寻月下人不归 2020-12-09 11:12

I have a large sequence of random integers sorted from the lowest to the highest. The numbers start from 1 bit and end near 45 bits. In the beginning of the list I have numb

6条回答

旧巷少年郎 (楼主)

2020-12-09 12:19
There's a very simple and fairly effective compression technique which can be used for sorted integers in a known range. Like most compression schemes, it is optimized for serial access, although you can build an index to speed up random access if needed.

It's a type of delta encoding (i.e. each number is represented by the distance from the previous one), consisting of a vector of codes which are either
- a single 1-bit, representing a delta of 2^k which is added to the delta in the following code, or
- a 0-bit followed by a k-bit delta, indicating that the next number is the specified delta from the previous one.
For example, if k is 4, the sequence:

00011 1 1 00000 1 00001

codes three numbers. The first four-bit encoding (3) is the first delta, taken from an initial value of 0, so the first number is 3. The next two solitary 1's accumulate to a delta of 2·2⁴, or 32, which is added to the following delta of 0000, for a total of 32. So the second number is 3+32=35. Finally, the last delta is a single 2⁴ plus 1, total 17, and the third number is 35+17=52.

The 1-bit indicates that the next delta should be incremented by 2^k (or, more generally, each delta is incremented by 2^k times the number of immediately preceding 1-bits.)

Another, possibly better, way of thinking of this is that each delta is coded as a variable length bit sequence: 1ⁱ0(1|0)^k, representing a delta of i·2^k+[the k-bit suffix]. But the first presentation aligns better with the optimality proof.

Since each "1" code represents an increment of 2^k, there cannot be more than m/2^k of them, where m is the largest number in the set to be compressed. The remaining codes all correspond to numbers, and have a total length of n·(k + 1) where n is the size of the set. The optimal value of k is roughly log₂ m/n, which in your case would be 7 or 8.

I did a quick proof of concept of the algorithm, without worrying about optimizations. It's still plenty fast; sorting the random sample takes a lot longer than compressing/decompressing it. I tried it with a few different seeds and vector sizes from 16,400,000 to 31,000,000 with a value range of [0, 4,000,000,000). The bits used per data value ranged from 8.59 (n=31000000) to 9.45 (n=16400000). All of the tests were done with 7-bit suffixes; log₂ m/n varies from 7.01 (n=31000000) to 7.93 (n=16400000). I tried with 6-bit and 8-bit suffixes; except in the case of n=31000000 where the 6-bit suffixes were slightly smaller, the 7-bit suffix was always the best. So I guess that the optimal k is not exactly floor(log₂ m/n) but it's not far off.

Compression code:
```
void Compress(std::ostream& os,
              const std::vector& v,
              unsigned long k = 0) {
  BitOut out(os);
  out.put(v.size(), 64);
  if (v.size()) {
    unsigned long twok;
    if (k == 0) {
      unsigned long ratio = v.back() / v.size();
      for (twok = 1; twok <= ratio / 2; ++k, twok *= 2) { }
    } else {
      twok = 1 << k;
    }
    out.put(k, 32);

    unsigned long prev = 0;
    for (unsigned long val : v) {
      while (val - prev >= twok) { out.put(1); prev += twok; }
      out.put(0);
      out.put(val - prev, k);
      prev = val;
    }
  }
  out.flush(1);
}
```
Decompression:
```
std::vector Decompress(std::istream& is) {
  BitIn in(is);
  unsigned long size = in.get(64);
  if (size) {
    unsigned long k = in.get(32);
    unsigned long twok = 1 << k;

    std::vector v;
    v.reserve(size);
    unsigned long prev = 0;
    for (; size; --size) {
      while (in.get()) prev += twok;
      prev += in.get(k);
      v.push_back(prev);
    }
  }
  return v;
}
```
It can be a bit awkward to use variable-length encodings; an alternative is to store the first bit of each code (1 or 0) in a bit vector, and the k-bit suffixes in a separate vector. This would be particularly convenient if k is 8.

A variant, which results in slight longer files but is a bit easier to build indexes for, is to only use the 1-bits as deltas. Then the deltas are always a·2^k for some a, possibly 0, where a is the number of consecutive 1 bits preceding the suffix code. The index then consists of the locations of every N^th 1-bit in the bit vector, and the corresponding index into the suffix vector (i.e. the index of the suffix corresponding with the next 0 in the bit vector).
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...