Compress numpy arrays efficiently

前端未结

关注

 6  1028

名媛妹妹 2020-12-03 13:30

I tried various methods to do data compression when saving to disk some numpy arrays.

These 1D arrays contain sampled data at a certain sampling rate (c

6条回答

野趣味 (楼主)

2020-12-03 14:18
1. Noise is incompressible. Thus, any part of the data that you have which is noise will go into the compressed data 1:1 regardless of the compression algorithm, unless you discard it somehow (lossy compression). If you have a 24 bits per sample with effective number of bits (ENOB) equal to 16 bits, the remaining 24-16 = 8 bits of noise will limit your maximum lossless compression ratio to 3:1, even if your (noiseless) data is perfectly compressible. Non-uniform noise is compressible to the extent to which it is non-uniform; you probably want to look at the effective entropy of the noise to determine how compressible it is.
2. Compressing data is based on modelling it (partly to remove redundancy, but also partly so you can separate from noise and discard the noise). For example, if you know your data is bandwidth limited to 10MHz and you're sampling at 200MHz, you can do an FFT, zero out the high frequencies, and store the coefficients for the low frequencies only (in this example: 10:1 compression). There is a whole field called "compressive sensing" which is related to this.
3. A practical suggestion, suitable for many kinds of reasonably continuous data: denoise -> bandwidth limit -> delta compress -> gzip (or xz, etc). Denoise could be the same as bandwidth limit, or a nonlinear filter like a running median. Bandwidth limit can be implemented with FIR/IIR. Delta compress is just y[n] = x[n] - x[n-1].
EDIT An illustration:
```
from pylab import *
import numpy
import numpy.random
import os.path
import subprocess

# create 1M data points of a 24-bit sine wave with 8 bits of gaussian noise (ENOB=16)
N = 1000000
data = (sin( 2 * pi * linspace(0,N,N) / 100 ) * (1<<23) + \
    numpy.random.randn(N) * (1<<7)).astype(int32)

numpy.save('data.npy', data)
print os.path.getsize('data.npy')
# 4000080 uncompressed size

subprocess.call('xz -9 data.npy', shell=True)
print os.path.getsize('data.npy.xz')
# 1484192 compressed size
# 11.87 bits per sample, ~8 bits of that is noise

data_quantized = data / (1<<8)
numpy.save('data_quantized.npy', data_quantized)
subprocess.call('xz -9 data_quantized.npy', shell=True)
print os.path.getsize('data_quantized.npy.xz')
# 318380
# still have 16 bits of signal, but only takes 2.55 bits per sample to store it
```
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...