Efficient serialization of numpy boolean arrays

前端未结

关注

 3  1801

一整个雨季 2020-12-10 15:07

I have hundreds of thousands of NumPy boolean arrays that I would like to use as keys to a dictionary. (The values of this dictionary are the number of times we\'ve observed

3条回答

爱一瞬间的悲伤 (楼主)

2020-12-10 15:21
I have three suggestions. My first is baldly stolen from aix. The problem is that bitarray objects are mutable, and their hashes are content-independent (i.e. for bitarray b, hash(b) == id(b)). This can be worked around, as aix's answer shows, but in fact you don't need bitarrays at all -- you can just use tostring!
```
In [1]: import numpy
In [2]: a = numpy.arange(25).reshape((5, 5))
In [3]: (a > 10).tostring()
Out[3]: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x01
         \x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01'
```
Now we have an immutable string of bytes, perfectly suitable for use as a dictionary key. To be clear, note that those escapes aren't escaped, so this is as compact as you can get without bitstring-style serialization.
```
In [4]: len((a > 10).tostring())
Out[4]: 25
```
Converting back is easy and fast:
```
In [5]: numpy.fromstring((a > 10).tostring(), dtype=bool).reshape(5, 5)
Out[5]: 
array([[False, False, False, False, False],
       [False, False, False, False, False],
       [False,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]], dtype=bool)
In [6]: %timeit numpy.fromstring((a > 10).tostring(), dtype=bool).reshape(5, 5)
100000 loops, best of 3: 5.75 us per loop
```
Like aix, I was unable to figure out how to store dimension information in a simple way. If you must have that, then you may have to put up with longer keys. cPickle seems like a good choice though. Still, its output is 10x as big...
```
In [7]: import cPickle
In [8]: len(cPickle.dumps(a > 10))
Out[8]: 255
```
It's also slower:
```
In [9]: cPickle.loads(cPickle.dumps(a > 10))
Out[9]: 
array([[False, False, False, False, False],
       [False, False, False, False, False],
       [False,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]], dtype=bool)
In [10]: %timeit cPickle.loads(cPickle.dumps(a > 10))
10000 loops, best of 3: 45.8 us per loop
```
My third suggestion uses bitstrings -- specifically, bitstring.ConstBitArray. It's similar in spirit to aix's solution, but ConstBitArrays are immutable, so they do what you want, hash-wise.
```
In [11]: import bitstring
```
You have to flatten the numpy array explicitly:
```
In [12]: b = bitstring.ConstBitArray((a > 10).flat)
In [13]: b.bin
Out[13]: '0b0000000000011111111111111'
```
It's immutable so it hashes well:
```
In [14]: hash(b)
Out[14]: 12144
```
It's super-easy to convert back into an array, but again, shape information is lost.
```
In [15]: numpy.array(b).reshape(5, 5)
Out[15]: 
array([[False, False, False, False, False],
       [False, False, False, False, False],
       [False,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]], dtype=bool)
```
It's also the slowest option by far:
```
In [16]: %timeit numpy.array(b).reshape(5, 5)
1000 loops, best of 3: 240 us per loop
```
Here's some more information. I kept fiddling around and testing things and came up with the following. First, bitarray is way faster than bitstring when you use it right:
```
In [1]: %timeit numpy.array(bitstring.ConstBitArray(a.flat)).reshape(5, 5)
1000 loops, best of 3: 283 us per loop

In [2]: %timeit numpy.array(bitarray.bitarray(a.flat)).reshape(5, 5)
10000 loops, best of 3: 19.9 us per loop
```
Second, as you can see from the above, all the tostring shenanigans are unnecessary; you could also just explicitly flatten the numpy array. But actually, aix's method is faster, so that's what the now-revised numbers below are based on.

So here's a full rundown of the results. First, definitions:
```
small_nda = numpy.arange(25).reshape(5, 5) > 10
big_nda = numpy.arange(10000).reshape(100, 100) > 5000
small_barray = bitarray.bitarray(small_nda.flat)
big_barray = bitarray.bitarray(big_nda.flat)
small_bstr = bitstring.ConstBitArray(small_nda.flat)
big_bstr = bitstring.ConstBitArray(big_nda.flat)
```
keysize is the result of sys.getsizeof({small|big}_nda.tostring()), sys.getsizeof({small|big}_barray) + sys.getsizeof({small|big}_barray.tostring()), or sys.getsizeof({small|big}_bstr) + sys.getsizeof({small|big}_bstr.tobytes()) -- both the latter methods return bitstrings packed into bytes, so they should be good estimates of the space taken by each.

speed is the time it takes to convert from {small|big}_nda to a key and back, plus the time it takes to convert a bitarray object into a string for hashing, which is either a one-time cost if you cache the string or a cost per dict operation if you don't cache it.
```
         small_nda   big_nda   small_barray   big_barray   small_bstr   big_bstr

keysize  64          10040     148            1394         100          1346

speed    2.05 us     3.15 us   3.81 us        96.3 us      277 us       92.2ms  
                             + 161 ns       + 257 ns 
```
As you can see, bitarray is impressively fast, and aix's suggestion of a subclass of bitarray should work well. Certainly it's a lot faster than bitstring. Glad to see that you accepted that answer.

On the other hand, I still feel attached to the numpy.array.tostring() method. The keys it generates are, asymptotically, 8x as large, but the speedup you get for big arrays remains substantial -- about 30x on my machine for large arrays. It's a good tradeoff. Still, it's probably not enough to bother with until it becomes the bottleneck.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...