Efficient serialization of numpy boolean arrays

前端 未结 3 1801
一整个雨季
一整个雨季 2020-12-10 15:07

I have hundreds of thousands of NumPy boolean arrays that I would like to use as keys to a dictionary. (The values of this dictionary are the number of times we\'ve observed

3条回答
  •  爱一瞬间的悲伤
    2020-12-10 15:21

    I have three suggestions. My first is baldly stolen from aix. The problem is that bitarray objects are mutable, and their hashes are content-independent (i.e. for bitarray b, hash(b) == id(b)). This can be worked around, as aix's answer shows, but in fact you don't need bitarrays at all -- you can just use tostring!

    In [1]: import numpy
    In [2]: a = numpy.arange(25).reshape((5, 5))
    In [3]: (a > 10).tostring()
    Out[3]: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x01
             \x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01'
    

    Now we have an immutable string of bytes, perfectly suitable for use as a dictionary key. To be clear, note that those escapes aren't escaped, so this is as compact as you can get without bitstring-style serialization.

    In [4]: len((a > 10).tostring())
    Out[4]: 25
    

    Converting back is easy and fast:

    In [5]: numpy.fromstring((a > 10).tostring(), dtype=bool).reshape(5, 5)
    Out[5]: 
    array([[False, False, False, False, False],
           [False, False, False, False, False],
           [False,  True,  True,  True,  True],
           [ True,  True,  True,  True,  True],
           [ True,  True,  True,  True,  True]], dtype=bool)
    In [6]: %timeit numpy.fromstring((a > 10).tostring(), dtype=bool).reshape(5, 5)
    100000 loops, best of 3: 5.75 us per loop
    

    Like aix, I was unable to figure out how to store dimension information in a simple way. If you must have that, then you may have to put up with longer keys. cPickle seems like a good choice though. Still, its output is 10x as big...

    In [7]: import cPickle
    In [8]: len(cPickle.dumps(a > 10))
    Out[8]: 255
    

    It's also slower:

    In [9]: cPickle.loads(cPickle.dumps(a > 10))
    Out[9]: 
    array([[False, False, False, False, False],
           [False, False, False, False, False],
           [False,  True,  True,  True,  True],
           [ True,  True,  True,  True,  True],
           [ True,  True,  True,  True,  True]], dtype=bool)
    In [10]: %timeit cPickle.loads(cPickle.dumps(a > 10))
    10000 loops, best of 3: 45.8 us per loop
    

    My third suggestion uses bitstrings -- specifically, bitstring.ConstBitArray. It's similar in spirit to aix's solution, but ConstBitArrays are immutable, so they do what you want, hash-wise.

    In [11]: import bitstring
    

    You have to flatten the numpy array explicitly:

    In [12]: b = bitstring.ConstBitArray((a > 10).flat)
    In [13]: b.bin
    Out[13]: '0b0000000000011111111111111'
    

    It's immutable so it hashes well:

    In [14]: hash(b)
    Out[14]: 12144
    

    It's super-easy to convert back into an array, but again, shape information is lost.

    In [15]: numpy.array(b).reshape(5, 5)
    Out[15]: 
    array([[False, False, False, False, False],
           [False, False, False, False, False],
           [False,  True,  True,  True,  True],
           [ True,  True,  True,  True,  True],
           [ True,  True,  True,  True,  True]], dtype=bool)
    

    It's also the slowest option by far:

    In [16]: %timeit numpy.array(b).reshape(5, 5)
    1000 loops, best of 3: 240 us per loop
    

    Here's some more information. I kept fiddling around and testing things and came up with the following. First, bitarray is way faster than bitstring when you use it right:

    In [1]: %timeit numpy.array(bitstring.ConstBitArray(a.flat)).reshape(5, 5)
    1000 loops, best of 3: 283 us per loop
    
    In [2]: %timeit numpy.array(bitarray.bitarray(a.flat)).reshape(5, 5)
    10000 loops, best of 3: 19.9 us per loop
    

    Second, as you can see from the above, all the tostring shenanigans are unnecessary; you could also just explicitly flatten the numpy array. But actually, aix's method is faster, so that's what the now-revised numbers below are based on.

    So here's a full rundown of the results. First, definitions:

    small_nda = numpy.arange(25).reshape(5, 5) > 10
    big_nda = numpy.arange(10000).reshape(100, 100) > 5000
    small_barray = bitarray.bitarray(small_nda.flat)
    big_barray = bitarray.bitarray(big_nda.flat)
    small_bstr = bitstring.ConstBitArray(small_nda.flat)
    big_bstr = bitstring.ConstBitArray(big_nda.flat)
    

    keysize is the result of sys.getsizeof({small|big}_nda.tostring()), sys.getsizeof({small|big}_barray) + sys.getsizeof({small|big}_barray.tostring()), or sys.getsizeof({small|big}_bstr) + sys.getsizeof({small|big}_bstr.tobytes()) -- both the latter methods return bitstrings packed into bytes, so they should be good estimates of the space taken by each.

    speed is the time it takes to convert from {small|big}_nda to a key and back, plus the time it takes to convert a bitarray object into a string for hashing, which is either a one-time cost if you cache the string or a cost per dict operation if you don't cache it.

             small_nda   big_nda   small_barray   big_barray   small_bstr   big_bstr
    
    keysize  64          10040     148            1394         100          1346
    
    speed    2.05 us     3.15 us   3.81 us        96.3 us      277 us       92.2ms  
                                 + 161 ns       + 257 ns 
    

    As you can see, bitarray is impressively fast, and aix's suggestion of a subclass of bitarray should work well. Certainly it's a lot faster than bitstring. Glad to see that you accepted that answer.

    On the other hand, I still feel attached to the numpy.array.tostring() method. The keys it generates are, asymptotically, 8x as large, but the speedup you get for big arrays remains substantial -- about 30x on my machine for large arrays. It's a good tradeoff. Still, it's probably not enough to bother with until it becomes the bottleneck.

提交回复
热议问题