问题
I'm dealing with a "big data" problem in python, and I am really struggling for scalable solutions.
The data structure I currently have is a massive dictionary of lists, with millions of keys and lists with millions of items. I need to do an operation on the items in the list. The problem is two-fold:
(1) How to do scalable operations on a data structure this size?
(2) How to do this with constraints of memory?
For some code, here's a very basic example of a dictionary of lists:
example_dict1 = {'key1':[367, 30, 847, 482, 887, 654, 347, 504, 413, 821],
'key2':[754, 915, 622, 149, 279, 192, 312, 203, 742, 846],
'key3':[586, 521, 470, 476, 693, 426, 746, 733, 528, 565]}
For a simple operation on the elements, the normal approach would be to iterate through the dictionary:
new_dictionary = {}
for k, v in example_dict.items():
new_list = []
for i in v: ## iterate through dictionary lists
i = compute_something(i) ## this is just an example, e.g. i**2 or i-13 would also work
new_list.append(i)
# now, create a new dictionary
new_dictionary[k] = new_list
The problem is, this cannot work for a dictionary this size---I'm working on a sever with over 250GB of RAM, and it quickly becomes too large for memory. It's also too slow/not scalable, as that's a single iteration with one processor.
Are there scalable solutions to this problem?
Maybe it would work to somehow break up the dictionary, do the calculations with multiprocessing, and aggregate? Or is there a way to save this data to disk?
I'm at a loss for ideas...
来源:https://stackoverflow.com/questions/60811456/how-to-scale-operations-with-a-massive-dictionary-of-lists-in-python