Efficient parallelization of operations on two dimensional array operations in python

耗尽温柔 提交于 2021-02-18 17:49:49

问题


I'm trying to parallelize operations on two dimensional array using joblib library in python. Here is the code I have

from joblib import Parallel, delayed
import multiprocessing
import numpy as np

# The code below just aggregates the base_array to form a new two dimensional array
base_array = np.ones((2**12, 2**12), dtype=np.uint8)
def compute_average(i, j):
    return np.uint8(np.mean(base_array[i*4: (i+1)*4, j*4: (j+1)*4]))

num_cores = multiprocessing.cpu_count()
new_array = np.array(Parallel(n_jobs=num_cores)(delayed(compute_average)(i, j) 
                                        for i in xrange(0,1024) for j in xrange(0,1024)), dtype=np.uint8)

The above code takes more time than the basic nested for loop below.

new_array_nested = np.ones((2**10, 2**10), dtype=np.uint8)
for i in xrange(0,1024):
    for j in xrange(0,1024):
         new_array_nested[i,j] = compute_average(i,j)

Why are parallel operations taking more time? How can the efficiency of the above code be improved?


回答1:


Wow! Absolutely loved your code. It worked like a charm improving the total efficiency by 400x. I'll try to read more about numba and jit compilers, but can you write briefly of why it is so efficient. Thanks once again for all the help! – Ram Jan 3 '18 at 20:30

We can quite easily get somewhere under 77 [ms], but it takes mastering a few steps to get there, so let's start:


Q: why parallel operations are taking more time?

Because the proposed step with joblib creates that many full-scale process copies - so as to escape from the GIL-stepped pure-[SERIAL] dancing ( one-after-another ) but (!) this includes add-on costs of all the memory transfers ( very expensive / sensitive for indeed large numpy arrays ) of all variables and the whole python interpreter and its internal state, before it gets to start doing a first step on the "usefull" work on your "payload"-calculation strategy,
so
the sum of all these instantiation overheads can easily become larger, than an overhead-agnostic expectation of an inversely proportional 1 / N factor,
where you set the N ~ num_cores.

For details, read the mathematical formulation in the tail part of Amdahl's Law re-formulation here.


Q:can help improve the efficiency of the above code?

Save as much as possible on all overhead costs:
- where possible:
- on process spawn-side, try to use n_jobs = ( num_cores - 1 ) to let more space for the "main" process going forward and benchmark if performance goes up
- on process termination-side, avoiding to collect-and-construct a new ( possibly great in size ) object from returning values, but rather pre-allocate a just-enough large process-local data structures and return some efficient, serialised for easy and non-blocking coalescing of the per-partes returned results' alignments.

Both of these "hidden" costs are your main design enemies, as they get linearly added to the pure-[SERIAL] part of the computing-path of the whole problem solution ( ref.: the effects of both of these in the overhead-strict Amdahl's Law formula )


Experiments & Results:

>>> from zmq import Stopwatch; aClk = Stopwatch()
>>> base_array = np.ones( (2**12, 2**12), dtype = np.uint8 )
>>> base_array.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA      : True
  WRITEABLE    : True
  ALIGNED      : True
  UPDATEIFCOPY : False
>>> def compute_average_per_TILE(               TILE_i,   TILE_j ): // NAIVE MODE
...     return np.uint8( np.mean( base_array[ 4*TILE_i:4*(TILE_i+1),
...                                           4*TILE_j:4*(TILE_j+1)
...                                           ]
...                               )
...                      )
... 
>>> aClk.start(); _ = compute_average_per_TILE( 12,13 ); aClk.stop()
25110
  102
  109
   93

This takes about 93 [us] per one shot. Having expectation of about 1024*1024*93 ~ 97,517,568 [us] to cover the mean-processing over the whole base_array.

Experimentally, here one nicely can see the impact of not very well handled overheads, the naive-nested experiment took:

>>> aClk.start(); _ = [ compute_average_per_TILE( i, j )
                                              for i    in xrange(1024)
                                              for    j in xrange(1024)
                        ]; aClk.stop()
26310594
^^...... 
26310594 / 1024. / 1024. == 25.09 [us/cell]

which is about 3.7x less ( due to not incurred "tail"-part ( assignment of individual returned values ) overheads 2**20 times, but just once, at the terminal assignment.

Yet, more surprises to come.


What is a proper tool here?

There is never a universal rule, no one-size-fits-all.

Given
not more than just a 4x4 matrix tile is going to be processes per call ( taking actually less than 25 [us] per a proposed joblib-orchestrated spawn of 2**20 calls, distributed over ~ .cpu_count() fully instantiated processes by the original proposal

...( joblib.Parallel( n_jobs = num_cores )( 
     joblib.delayed( compute_average )( i, j ) 
                                    for i    in xrange( 1024 )
                                    for    j in xrange( 1024 )
     )

there is indeed a space to improve the performance.

For these small-scale matrices ( not all problems are so happy in this sense ), one can expect best results from smarter-memory access patterns and from reducing python GIL-originated weaknesses.

As the per-call span is just a 4x4 micro sized computation, a way better will be to harness smart vectorisation ( all data fit in cache, so in-cache computing is a holiday journey for hunting an utmost performance )

The best ( still very naively vectorised code )
was able to get from ~ 25 [us/cell] to less than ~ 74 [ns/cell] ( having there still a space for better aligned processing, as it took ~ 4.6 [ns] / a base_array cell processing ), so expect yet another level of speedups, if in-cache optimised vectorised code will get crafted properly.

In 77 [ms] ?! Worth doing that right, isn't it?

Not 97 seconds,
not 25 seconds,
but less than 77 [ms] in just a few strokes of keyboard, and more could have got squeezed off, if better optimising a call-signature:

>>> import numba
>>> @numba.jit( nogil = True, nopython = True )
... def jit_avg2( base_IN, ret_OUT ):  // all pre-allocated memory for these data-structures
...     for i in np.arange( 1024 ):    // vectorised-code ready numpy iterator
...         for j in np.arange( 1024 ):// vectorised-code ready numpy iterator 
...             ret_OUT[i,j] = np.uint8( np.mean( base_IN[4*i:4*(i+1),
...                                                       4*j:4*(j+1)
...                                                       ]
...                                               )
...                                      )
...     return                         // avoid terminal assignment costs
... 

>>> aClk.start(); _ = jit_avg2( base_array, mean_array ); aClk.stop()
1586182 (even with all the jit-compilation circus, it was FASTER than GIL-stepped nested fors ...)
  76935
  77337


来源:https://stackoverflow.com/questions/48068584/efficient-parallelization-of-operations-on-two-dimensional-array-operations-in-p

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!