How to scale operations with a massive dictionary of lists in Python?

寵の児 提交于 2020-05-15 08:12:10

问题


I'm dealing with a "big data" problem in python, and I am really struggling for scalable solutions.

The data structure I currently have is a massive dictionary of lists, with millions of keys and lists with millions of items. I need to do an operation on the items in the list. The problem is two-fold:

(1) How to do scalable operations on a data structure this size?

(2) How to do this with constraints of memory?

For some code, here's a very basic example of a dictionary of lists:

example_dict1 = {'key1':[367, 30, 847, 482, 887, 654, 347, 504, 413, 821],
    'key2':[754, 915, 622, 149, 279, 192, 312, 203, 742, 846], 
    'key3':[586, 521, 470, 476, 693, 426, 746, 733, 528, 565]}

For a simple operation on the elements, the normal approach would be to iterate through the dictionary:

new_dictionary = {}

for k, v in example_dict.items():
    new_list = []    
    for i in v:      ## iterate through dictionary lists
        i = compute_something(i) ## this is just an example, e.g. i**2 or i-13 would also work
        new_list.append(i)
    # now, create a new dictionary
    new_dictionary[k] = new_list

The problem is, this cannot work for a dictionary this size---I'm working on a sever with over 250GB of RAM, and it quickly becomes too large for memory. It's also too slow/not scalable, as that's a single iteration with one processor.

Are there scalable solutions to this problem?

Maybe it would work to somehow break up the dictionary, do the calculations with multiprocessing, and aggregate? Or is there a way to save this data to disk?

I'm at a loss for ideas...

来源:https://stackoverflow.com/questions/60811456/how-to-scale-operations-with-a-massive-dictionary-of-lists-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!