Best Compression algorithm for a sequence of integers

前端 未结 15 1583
离开以前
离开以前 2020-11-29 16:41

I have a large array with a range of integers that are mostly continuous, eg 1-100, 110-160, etc. All integers are positive. What would be the best algorithm to compress thi

相关标签:
15条回答
  • 2020-11-29 17:23

    In addition to the other solutions:

    You could find "dense" areas and use a bitmap to store them.

    So for example:

    If you have 1000 numbers in 400 ranges between 1000-3000, you could use a single bit to denote the existence of a number and two ints to denote the range. Total storage for this range is 2000 bits + 2 ints, so you can store that info in 254bytes, which is pretty awesome since even short integers will take up two bytes each, so for this example you get 7X savings.

    The denser the areas the better this algorithm will do, but at some point just storing start and finish will be cheaper.

    0 讨论(0)
  • 2020-11-29 17:26

    The basic idea you should probably use is, for each range of consecutive integers (I will call these ranges), to store the starting number and the size of the range. For example, if you have a list of 1000 integers, but there are only 10 separate ranges, you can store a mere 20 integers (1 start number and 1 size for each range) to represent this data which would be a compression rate of 98%. Fortunately, there are some more optimizations you can make which will help with cases where the number of ranges is larger.

    1. Store the offset of the starting number relative to the previous starting number, rather than the starting number itself. The advantage here is that the numbers you store will generally require less bits (this may come in handy in later optimization suggestions). Additionally, if you only stored the starting numbers, these numbers would all be unique, while storing the offset gives a chance that the numbers are close or even repeat which may allow for even further compression with another method being applied after.

    2. Use the minimum number of bits possible for both types of integers. You can iterate over the numbers to obtain the largest offset of a starting integer as well as the size of the largest range. You can then use a datatype that most efficiently stores these integers and simply specify the datatype or number of bits at the start of the compressed data. For example, if the largest offset of a starting integer is only 12,000, and the largest range is 9,000 long, then you can use a 2 byte unsigned integer for all of these. You could then cram the pair 2,2 at the start of the compressed data to show that 2 bytes is used for both integers. Of course you can fit this information into a single byte using a little bit of bit manipulation. If you are comfortable with doing a lot of heavy bit manipulation you could store each number as the minimum possible amount of bits rather than conforming to 1, 2, 4, or 8 byte representations.

    With those two optimizations lets look at a couple of examples (each is 4,000 bytes):

    1. 1,000 integers, biggest offset is 500, 10 ranges
    2. 1,000 integers, biggest offset is 100, 50 ranges
    3. 1,000 integers, biggest offset is 50, 100 ranges

    WITHOUT OPTIMIZATIONS

    1. 20 integers, 4 bytes each = 80 bytes. COMPRESSION = 98%
    2. 100 integers, 4 bytes each = 400 bytes. COMPRESSION = 90%
    3. 200 integers, 4 bytes each = 800 bytes. COMPRESSION = 80%

    WITH OPTIMIZATIONS

    1. 1 byte header + 20 numbers, 1 byte each = 21 bytes. COMPRESSION = 99.475%
    2. 1 byte header + 100 numbers, 1 byte each = 101 bytes. COMPRESSION = 97.475%
    3. 1 byte header + 200 numbers, 1 byte each = 201 bytes. COMPRESSION = 94.975%
    0 讨论(0)
  • 2020-11-29 17:28

    I'd suggest taking a look at Huffman Coding, a special case of Arithmetic Coding. In both cases you analyse your starting sequence to determine the relative frequencies of different values. More-frequently-occurring values are encoded with fewer bits than the less-frequently-occurring ones.

    0 讨论(0)
  • 2020-11-29 17:30

    I couldn't get my compression to be much better than about .11 for this. I generated my test data via python interpreter and it's a newline delimited list of integers from 1-100, and 110-160. I use the actual program as a compressed representation of the data. My compressed file is as follows:

    main=mapM_ print [x|x<-[1..160],x`notElem`[101..109]]
    

    It's just a Haskell script that generates the the file you can run via:

    $ runhaskell generator.hs >> data
    

    The size of the g.hs file is 54 bytes, and the python generated data is 496 bytes. This gives 0.10887096774193548 as the compression ratio. I think with more time one could shrink the program, or you could compress the compressed file (i.e. the haskell file).

    One other approach might be to save 4 bytes of data. The min and max of each sequence, then give those to a generating function. Albeit, the loading of files will add more characters to the decompresser adding more complexity and more bytes to the decompresser. Again, I represented this very specific sequence via a program and it doesn't generalize, it's a compression that's specific to the data. Furthermore, adding generality makes the decompresser larger.

    Another concern is that one must have the Haskell interpreter to run this. When I compiled the program it made it much larger. I don't really know why. There is the same problem with python, so maybe the best approach is to give the ranges, so that a some program could decompress the file.

    0 讨论(0)
  • 2020-11-29 17:32

    First, preprocess your list of values by taking the difference between each value and the previous one (for the first value, assume the previous one was zero). This should in your case give mostly a sequence of ones, which can be compressed much more easily by most compression algorithms.

    This is how the PNG format does to improve its compression (it does one of several difference methods followed by the same compression algorithm used by gzip).

    0 讨论(0)
  • 2020-11-29 17:33

    If you have series of repeated values RLE is the easiest to implement and could give you a good result. Nontheless other more advanced algorithms that take into account the entrophy such as LZW, which is now patent-free, can usually achive a much better compression.

    You can take a look at these and other lossless algorithms here.

    0 讨论(0)
提交回复
热议问题