efficient way to hold and process a big dict in memory in python

后端未结

关注

 4  712

梦谈多话 2021-01-11 18:20

As I did a bit test, a python dict of int=>int (different value) of 30 million items can easily eats >2G memory on my mac. Since I work with only int to int dict, is there a

4条回答

不要未来只要你来 (楼主)

2021-01-11 18:51
Judy-array based solution seems the option I should look into. I'm still looking for a good implementation that can be used by Python. Will update later.

Update,

finally I'm experimenting a Judy array wrapper at http://code.google.com/p/py-judy/ . Seems no any document there but I tried to find its methods simply by dir(...) its package and object, however it works.

Same experiment it eats ~986MB at ~1/3 of standard dict by using judy.JudyIntObjectMap. It also provides JudyIntSet which in some special scenario will save much more memory since it doesn't need to reference to any real Python object as value comparing to JudyIntObjectMap.

(As tested further as below, JudyArray simply uses several MB to tens of MB, most of ~986MB is actually used by value objects in Python memory space.)

Here's some code if it helps for you,
```
>>> import judy
>>> dir(judy)
['JudyIntObjectMap', 'JudyIntSet', '__doc__', '__file__', '__name__', '__package__']
>>> a=judy.JudyIntObjectMap()
>>> dir(a)
['__class__', '__contains__', '__delattr__', '__delitem__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__value_sizeof__', 'by_index', 'clear', 'get', 'iteritems', 'iterkeys', 'itervalues', 'pop']
>>> a[100]=1
>>> a[100]="str"
>>> a["str"]="str"
Traceback (most recent call last):
  File "", line 1, in 
KeyError: 'non-integer keys not supported'
>>> for i in xrange(30000000):
...     a[i]=i+30000000   #finally eats ~986MB memory
... 
```
Update,

ok, a JudyIntSet of 30M int as tested.
```
>>> a=judy.JudyIntSet()
>>> a.add(1111111111111111111111111)
Traceback (most recent call last):
  File "", line 1, in 
ValueError: we only support integers in the range [0, 2**64-1]
```
It totally uses only 5.7MB to store 30M sequential int array [0,30000000) which may due to JudyArray's auto compression. Above 709MB is bcz I used range(...) instead of more proper xrange(...) to generate the data.

So the size of the core JudyArray with 30M int is simply ignorable.

If anyone knows a more complete Judy Array wrapper implementation please let me know, since this wrapper only wraps JudyIntObjectMap and JudyIntSet. For int-int dict, JudyIntObjectMap still requires real python object. If we only do counter_add and set on the values, it would be a good idea to store int of values in C space rather than using python object. Hope someone be interested to create or introduce one : )
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...