Compress numpy arrays efficiently

前端 未结 6 1007
名媛妹妹
名媛妹妹 2020-12-03 13:30

I tried various methods to do data compression when saving to disk some numpy arrays.

These 1D arrays contain sampled data at a certain sampling rate (c

6条回答
  •  野趣味
    野趣味 (楼主)
    2020-12-03 14:18

    1. Noise is incompressible. Thus, any part of the data that you have which is noise will go into the compressed data 1:1 regardless of the compression algorithm, unless you discard it somehow (lossy compression). If you have a 24 bits per sample with effective number of bits (ENOB) equal to 16 bits, the remaining 24-16 = 8 bits of noise will limit your maximum lossless compression ratio to 3:1, even if your (noiseless) data is perfectly compressible. Non-uniform noise is compressible to the extent to which it is non-uniform; you probably want to look at the effective entropy of the noise to determine how compressible it is.

    2. Compressing data is based on modelling it (partly to remove redundancy, but also partly so you can separate from noise and discard the noise). For example, if you know your data is bandwidth limited to 10MHz and you're sampling at 200MHz, you can do an FFT, zero out the high frequencies, and store the coefficients for the low frequencies only (in this example: 10:1 compression). There is a whole field called "compressive sensing" which is related to this.

    3. A practical suggestion, suitable for many kinds of reasonably continuous data: denoise -> bandwidth limit -> delta compress -> gzip (or xz, etc). Denoise could be the same as bandwidth limit, or a nonlinear filter like a running median. Bandwidth limit can be implemented with FIR/IIR. Delta compress is just y[n] = x[n] - x[n-1].

    EDIT An illustration:

    from pylab import *
    import numpy
    import numpy.random
    import os.path
    import subprocess
    
    # create 1M data points of a 24-bit sine wave with 8 bits of gaussian noise (ENOB=16)
    N = 1000000
    data = (sin( 2 * pi * linspace(0,N,N) / 100 ) * (1<<23) + \
        numpy.random.randn(N) * (1<<7)).astype(int32)
    
    numpy.save('data.npy', data)
    print os.path.getsize('data.npy')
    # 4000080 uncompressed size
    
    subprocess.call('xz -9 data.npy', shell=True)
    print os.path.getsize('data.npy.xz')
    # 1484192 compressed size
    # 11.87 bits per sample, ~8 bits of that is noise
    
    data_quantized = data / (1<<8)
    numpy.save('data_quantized.npy', data_quantized)
    subprocess.call('xz -9 data_quantized.npy', shell=True)
    print os.path.getsize('data_quantized.npy.xz')
    # 318380
    # still have 16 bits of signal, but only takes 2.55 bits per sample to store it
    

提交回复
热议问题