How to parallelize computation on “big data” dictionary of lists?

I have a question here regarding doing calculations on a python dictionary----in this case, the dictionary has millions of keys, and the lists are similarly long. There seems to be disagreement whether one could use parallelization here, so I'll ask the question here more explicitly. Here is the original question:

This is a toy (small) python dictionary:

example_dict1 = {'key1':[367, 30, 847, 482, 887, 654, 347, 504, 413, 821],
    'key2':[754, 915, 622, 149, 279, 192, 312, 203, 742, 846], 
    'key3':[586, 521, 470, 476, 693, 426, 746, 733, 528, 565]}

Let's say I need to parse the values of the lists, which I've implemented into the following simple (toy) function:

def manipulate_values(input_list):
    return_values = []
    for i in input_list:
        new_value = i ** 2 - 13
    return return_values

Now, I can easily parse the values of this dictionary as follows:

for key, value in example_dict1.items():
    example_dict1[key] = manipulate_values(value)

resulting in the following:

example_dict1 = {'key1': [134676, 887, 717396, 232311, 786756, 427703, 120396, 254003, 170556, 674028], 
     'key2': [568503, 837212, 386871, 22188, 77828, 36851, 97331, 41196, 550551, 715703], 
     'key3': [343383, 271428, 220887, 226563, 480236, 181463, 556503, 537276, 278771, 319212]}

Question: Why couldn't I use multiple threads to do this calculation, e.g. three threads, one for key1, key2, and key3? Would concurrent.futures.ProcessPoolExecutor() work here?

Original question: Are there better ways to optimize this take to be quick?


python threads will not really help you to process in parallel since they are executed on the same one "real CPU thread", python threads are helpful when you deal with asynchronous HTTP calls

AboutProcessPoolExecutor from the docs:


The ProcessPoolExecutor class is an Executor subclass that uses a pool of processes to execute calls asynchronously. ProcessPoolExecutor uses the multiprocessing module, which allows it to side-step the Global Interpreter Lock but also means that only picklable objects can be executed and returned.

it can help you if you need high CPU processing, you can use:

import concurrent

def manipulate_values(k_v):
    k, v = k_v
    return_values = []
    for i in v :
        new_value = i ** 2 - 13
    return k, return_values

with concurrent.futures.ProcessPoolExecutor() as executor:
        example_dict = dict(, example_dict1.items()))

here is a simple benchmark, using a simple for loop to process your data versus using ProcessPoolExecutor, my scenario assume that for each item to be processed you need ~50ms CPU time:

you can see the real benefit from ProcessPoolExecutor if the CPU time per item to be processed is high

from simple_benchmark import BenchmarkBuilder
import time
import concurrent

b = BenchmarkBuilder()

def manipulate_values1(k_v):
    k, v = k_v
    return k, v

def manipulate_values2(v):
    return v

def test_with_process_pool_executor(d):
    with concurrent.futures.ProcessPoolExecutor() as executor:
            return dict(, d.items()))

def test_simple_for_loop(d):
    for key, value in d.items():
        d[key] = manipulate_values2((key, value))

@b.add_arguments('Number of keys in dict')
def argument_provider():
    for exp in range(2, 10):
        size = 2**exp
        yield size, {i: [i] * 10_000 for i in range(size)}

r =

if you do not set the number of workers for ProcessPoolExecutor the default number of workers will be equal with the number of processors on your machine (for the benchmark I used a pc with 8 CPU).

but in your case, with the data provided in your question, to process 1 item will take ~3 µs:

%timeit manipulate_values([367, 30, 847, 482, 887, 654, 347, 504, 413, 821])
2.32 µs ± 25.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

in which case the benchmark will look:

So it is better to use a simple for loop if the CPU time for one item to be processed is low.

a good point raised by @user3666197 is the case when you have huge items/lists, I benchmarked both approaches using 1_000_000_000 random numbers in a list:

as you can see in this case is more suitable to use ProcessPoolExecutor

from simple_benchmark import BenchmarkBuilder
import time
import concurrent
from random import choice

b = BenchmarkBuilder()

def manipulate_values1(k_v):
    k, v = k_v
    return_values = []
    for i in v:
        new_value = i ** 2 - 13

    return k, return_values

def manipulate_values2(v):
    return_values = []
    for i in v:
        new_value = i ** 2 - 13
    return return_values

def test_with_process_pool_executor(d):
    with concurrent.futures.ProcessPoolExecutor() as executor:
            return dict(, d.items()))

def test_simple_for_loop(d):
    for key, value in d.items():
        d[key] = manipulate_values2(value)

@b.add_arguments('Number of keys in dict')
def argument_provider():
    for exp in range(2, 5):
        size = 2**exp
        yield size, {i: [choice(range(1000)) for _ in range(1_000_000)] for i in range(size)}

r =

expected since to process one item it takes ~209ms:

l = [367] * 1_000_000
%timeit manipulate_values2(l)
# 209 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

still, the fastest option will be to use numpy.arrays with the for loop solution:

from simple_benchmark import BenchmarkBuilder
import time
import concurrent
import numpy as np

b = BenchmarkBuilder()

def manipulate_values1(k_v):
    k, v = k_v
    return k,  v ** 2 - 13

def manipulate_values2(v):
    return v ** 2 - 13

def test_with_process_pool_executor(d):
    with concurrent.futures.ProcessPoolExecutor() as executor:
            return dict(, d.items()))

def test_simple_for_loop(d):
    for key, value in d.items():
        d[key] = manipulate_values2(value)

@b.add_arguments('Number of keys in dict')
def argument_provider():
    for exp in range(2, 7):
        size = 2**exp
        yield size, {i: np.random.randint(0, 1000, size=1_000_000) for i in range(size)}

r =

it is expected that the simple for loop to be faster since to process one numpy.array takes < 1ms:

def manipulate_value2( input_list ):
    return input_list ** 2 - 13

l = np.random.randint(0, 1000, size=1_000_000)
%timeit manipulate_values2(l)
# 951 µs ± 5.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Q : "Why couldn't I use multiple threads to do this calculation, e.g. three threads, one for key1, key2, and key3?"

You could, yet for no reasonable effect on performance - knowing all details about how python handles the thread-based flow of execution is cardinal here. Learn about the GIL-lock trick, used right for it avoiding any concurrent processing and its effects on performance you get the WHY-part.

Q : "Would concurrent.futures.ProcessPoolExecutor() work here?"


Yet the net-effect thereof ( if any "faster" than a pure-[SERIAL] flow of processing ) will depend on a given size of the "large"-lists (as warned to be (cit.)"millions of keys, and the lists are similarly long." above) that ought get copied ( RAM-I/O ) and passed ( SER/DES-processed + IPC-transferred ) to the pool of spawned ( process-based ) remote executors.

These many times repeated RAM-I/O + SER/DES add-on overhead costs will soon dominate.

A RAM-I/O copy step:

>>> from zmq import Stopwatch; aClk = Stopwatch()

>>> aClk.start(); aList = [ i for i in range( int( 1E4 ) ) ]; aClk.stop()
   1345 [us] to copy a List of 1E4 elements
>>> aClk.start(); aList = [ i for i in range( int( 1E5 ) ) ]; aClk.stop()
  12776 [us] to copy a List of 1E5 elements
>>> aClk.start(); aList = [ i for i in range( int( 1E6 ) ) ]; aClk.stop()
 149197 [us] to copy a List of 1E6 elements
>>> aClk.start(); aList = [ i for i in range( int( 1E7 ) ) ]; aClk.stop()
1253792 [us] to copy a List of 1E7 elements
|  |::: [us]
|  +--- [ms]
+------ [ s]

SER/DES step :

>>> import pickle
>>> aClk.start(); _ = pickle.dumps( aList ); aClk.stop()
 638821 [us] to copy pickle.dumps() a List of 1E7 elements
|  |::: [us]
|  +--- [ms]
+------ [ s]

So the expected, per batch add-on overhead is ~ 2 x ( 1253 + 608 ) [ms] + IPC-transfer costs for just a one shot of 1E7-items

The actual useful-work payload of manipulate_values() is so small, that the lump sum of all the add-on costs would hardly cover the added expenses, associated with distributing the work-units across the pool-of-remote workers. Much smarter results are to be expected from vectorised forms of computing. The add-on costs here are awfully larger than the small amount of useful-work.

The more the schema will depend on the overhead costs of SER/DES parameters passing "there" plus the add-on costs of the SER/DES on results being returned "back" - all of which altogether will decide on the net-effect ( anti-speedups << 1.0 x are quite often observed on use-cases, introduced with but a poor design-side engineering practices, no late benchmarks can salvage the already burnt man*days, wasted in such poor design decision )

