2D array to represent a huge python dict, COOrdinate like solution to save memory

家住魔仙堡 提交于 2019-12-23 12:50:30

问题


I try to update a dict_with_tuples_key with the data from an array:

myarray = np.array([[0, 0],  # 0, 1
                    [0, 1],
                    [1, 1],  # 1, 2
                    [1, 2],  # 1, 3
                    [2, 2],
                    [1, 3]]
) # a lot of this with shape~(10e6, 2)

dict_with_tuples_key = {(0, 1): 1,
                        (3, 7): 1} # ~10e6 keys 

Using an array to store the dict values, (thanks to @MSeifert) we get this:

def convert_dict_to_darray(dict_with_tuples_key, myarray):
    idx_max_array = np.max(myarray, axis=0)
    idx_max_dict  = np.max(dict_with_tuples_key.keys(), axis=0)
    lens = np.max([list(idx_max_array), list(idx_max_dict)], axis=0)
    xlen, ylen = lens[0] + 1, lens[1] + 1
    darray = np.zeros((xlen, ylen)) # Empty array to hold all indexes in myarray
    for key, value in dict_with_tuples_key.items():
        darray[key] = value
    return darray

@njit
def update_darray(darray, myarray):
    elements = myarray.shape[0]
    for i in range(elements):
        darray[myarray[i][0]][myarray[i][1]] += 1
    return darray

def darray_to_dict(darray):
    updated_dict = {}
    keys = zip(*map(list, np.nonzero(darray)))
    for x, y in keys:
        updated_dict[(x, y)] = darray[x, y]
    return updated_dict

darray = convert_dict_to_darray(dict_with_tuples_key, myarray)
darray = update_darray(darray, myarray)

I get the exact result needed:

# print darray_to_dict(darray)
# {(0, 1): 2.0,
#  (0, 0): 1.0,
#  (1, 1): 1.0,
#  (2, 2): 1.0,
#  (1, 2): 1.0,
#  (1, 3): 1.0,
#  (3, 7): 1.0, }

For small matrix it work quit well, @njit work on it so it's very fast, but...

the creation of the huge empty darray = np.zeros((xlen, ylen)) does not fit on memory. How can we avoid to assign a very sparse array, and only store non null values like sparse matrix in COOrdinate format ?


回答1:


Use dok_matrix from scipy; a dock_matrix is a dictionary Of Keys based sparse matrix. They allow you to build sparse matrices incrementally and they won't allocate huge empty darray = np.zeros((xlen, ylen)) that does not fit into your computer memory.

The only change to do is to import the right module from scipy and to change the definition of darray in your function convert_dict_to_darray.

It will look like this:

from scipy.sparse import dok_matrix

def convert_dict_to_darray(dict_with_tuples_key, myarray):
    idx_max_array = np.max(myarray, axis=0)
    idx_max_dict  = np.max(dict_with_tuples_key.keys(), axis=0)
    lens = np.max([list(idx_max_array), list(idx_max_dict)], axis=0)
    xlen, ylen = lens[0] + 1, lens[1] + 1
    darray = dok_matrix( (xlen, ylen) )
    for key, value in dict_with_tuples_key.items():
        darray[key[0], key[1]] = value
    return darray


来源:https://stackoverflow.com/questions/35340440/2d-array-to-represent-a-huge-python-dict-coordinate-like-solution-to-save-memor

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!