Optimize Double loop in python

倾然丶 夕夏残阳落幕 提交于 2020-01-02 02:04:30

问题


I am trying to optimize the following loop :

def numpy(nx, nz, c, rho):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):
            a[ix, iz]  = sum(c*rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum(c*rho[ix-2:ix+2, iz])
    return a, b

I tried different solutions and found using numba to calculate the sum of the product leads to better performances:

import numpy as np
import numba as nb
import time

@nb.autojit
def sum_opt(arr1, arr2):
    s = arr1[0]*arr2[0]
    for i in range(1, len(arr1)):
        s+=arr1[i]*arr2[i]
    return s

def numba1(nx, nz, c, rho):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b

@nb.autojit
def numba2(nx, nz, c, rho):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b
nx = 1024
nz = 256    

rho = np.random.rand(nx, nz)
c = np.random.rand(4)
a = np.zeros((nx, nz))
b = np.zeros((nx, nz))

ti = time.clock()
a, b = numpy(nx, nz, c, rho)
print 'Time numpy  : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
a, b = numba1(nx, nz, c, rho)
print 'Time numba1 : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
a, b = numba2(nx, nz, c, rho)
print 'Time numba2 : ' + `round(time.clock() - ti, 4)`

This lead to

Time numpy : 4.1595

Time numba1 : 0.6993

Time numba2 : 1.0135

Using the numba version of the sum function (sum_opt) performs very well. But I am wondering why the numba version of the double loop function (numba2) leads to slower execution times. I tried to use jit instead of autojit, specifying the argument types, but it was worse.

I also noticed that looping first on the smallest loop is slower than looping first on the biggest loop. Is there any explanation ?

Whether it is, I am sure this double loop function can be improved a lot vectorizing the problem (like this) or using another method (map ?) but I am a little bit confused about these methods.

In the other parts of my code, I used numba and numpy slicing methods to replace all explicit loops but in this particular case, I don't how to set it up.

Any ideas ?

EDIT

Thanks for all your comments. I worked a little on this problem:

import numba as nb
import numpy as np
from scipy import signal
import time


@nb.jit(['float64(float64[:], float64[:])'], nopython=True)
def sum_opt(arr1, arr2):
    s = arr1[0]*arr2[0]
    for i in xrange(1, len(arr1)):
        s+=arr1[i]*arr2[i]
    return s

@nb.autojit
def numba1(nx, nz, c, rho, a, b):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b


@nb.jit(nopython=True)
def numba2(nx, nz, c, rho, a, b):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b

@nb.jit(['float64[:,:](int16, int16, float64[:], float64[:,:], float64[:,:])'], nopython=True)
def numba3a(nx, nz, c, rho, a):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
    return a

@nb.jit(['float64[:,:](int16, int16, float64[:], float64[:,:], float64[:,:])'], nopython=True)
def numba3b(nx, nz, c, rho, b):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return b

def convol(nx, nz, c, aa, bb):
    s1 = rho[1:nx-1,2:nz-3]
    s2 = rho[0:nx-2,2:nz-3]
    kernel = c[:,None][::-1]
    aa[2:nx-3,2:nz-3] = signal.convolve2d(s1, kernel, boundary='symm', mode='valid')
    bb[2:nx-3,2:nz-3] = signal.convolve2d(s2, kernel, boundary='symm', mode='valid')
    return aa, bb


nx = 1024
nz = 256 
rho = np.random.rand(nx, nz)
c = np.random.rand(4)
a = np.zeros((nx, nz))
b = np.zeros((nx, nz))

ti = time.clock()
for i in range(1000):
    a, b = numba1(nx, nz, c, rho, a, b)
print 'Time numba1 : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
for i in range(1000):
    a, b = numba2(nx, nz, c, rho, a, b)
print 'Time numba2 : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
for i in range(1000):
    a = numba3a(nx, nz, c, rho, a)
    b = numba3b(nx, nz, c, rho, b)
print 'Time numba3 : ' + `round(time.clock() - ti, 4)`

ti = time.clock()
for i in range(1000):
    a, b = convol(nx, nz, c, a, b)
print 'Time convol : ' + `round(time.clock() - ti, 4)`

Your solution is very elegant Divakar, but I have to use this function a large number of time in my code. So, for 1000 iterations, this lead to

Time numba1 : 3.2487

Time numba2 : 3.7012

Time numba3 : 3.2088

Time convol : 22.7696

autojit and jit are very close. However, when using jit, it seems important to specify all argument types.

I do not know if there is a way to specify argument types in the jit decorator when the function has multiple outputs. Someone ?

For now I did not find other solution than using numba. New ideas are welcomed !


回答1:


Numba is very fast in nopython mode but with your code it has to fall back to object mode, which is a lot slower. You can see this happening if you pass nopython=True to the jit decorator.

It does compile in nopython mode (at least in Numba version 0.18.2) if you pass a and b as arguments:

import numba as nb

@nb.jit(nopython=True)
def sum_opt(arr1, arr2):
    s = arr1[0]*arr2[0]
    for i in range(1, len(arr1)):
        s+=arr1[i]*arr2[i]
    return s

@nb.jit(nopython=True)
def numba2(nx, nz, c, rho, a, b):
    for ix in range(2, nx-3):
        for iz in range(2, nz-3):        
            a[ix, iz]  = sum_opt(c, rho[ix-1:ix+3, iz])
            b[ix, iz]  = sum_opt(c, rho[ix-2:ix+2, iz])
    return a, b

Note that in the release notes there is mention of autojit being deprecated in favor of jit.


Apparently you're not satisfied yet. So how about a solution based on stride_tricks?

from numpy.lib.stride_tricks import as_strided

def stridetrick_einsum(c, rho, out):
    ws = len(c)
    nx, nz = rho.shape

    shape = (nx-ws+1, ws, nz)
    strides = (rho.strides[0],) + rho.strides
    rho_windowed = as_strided(rho, shape, strides)

    np.einsum('j,ijk->ik', c, rho_windowed, out=out)

a = np.zeros((nx, nz))
stridetrick_einsum(c, rho[1:-1,2:-3], a[2:-3,2:-3])
b = np.zeros((nx, nz))
stridetrick_einsum(c, rho[0:-2,2:-3], b[2:-3,2:-3])

What's more, since a and b are obviously almost exactly the same, you can calculate them in one go and then copy the values over:

a = np.zeros((nx, nz))
stridetrick_einsum(c, rho[:-1,2:-3], a[1:-3,2:-3])
b = np.zeros((nx, nz))
b[2:-3,2:-3] = a[1:-4,2:-3]
a[1,2:-3] = 0.0



回答2:


You are basically performing 2D convolution there, with a small modification that your kernel is not reversing as the usual convolution operation does. So, basically, there are two things we need to do here to use signal.convolve2d to solve our case -

  • Slice the input array rho to select a portion of it which is used in the original loopy version of your code. This would be the input data to convolution.
  • Reverse the kernel, c and feed it alongwith the sliced data to signal.convolve2d.

Please note that these are to be done for calculation of both a and b separately.

Here's the implementation -

import numpy as np
from scipy import signal

# Slices for convolutions to get a and b respectively        
s1 = rho[1:nx-1,2:nz-3]
s2 = rho[0:nx-2,2:nz-3]
kernel = c[:,None][::-1]  # convolution kernel

# Setup output arrays and fill them with convolution results
a = np.zeros((nx, nz))
b = np.zeros((nx, nz))

a[2:nx-3,2:nz-3] = signal.convolve2d(s1, kernel, boundary='symm', mode='valid')
b[2:nx-3,2:nz-3] = signal.convolve2d(s2, kernel, boundary='symm', mode='valid')

If you don't need the extra zeros around the boundaries of the output arrays, you could simply use the outputs from signal.convolve2d as they are, which must further boost up the performance.

Runtime tests

In [532]: %timeit loop_based(nx, nz, c, rho)
1 loops, best of 3: 1.52 s per loop

In [533]: %timeit numba1(nx, nz, c, rho)
1 loops, best of 3: 282 ms per loop

In [534]: %timeit numba2(nx, nz, c, rho)
1 loops, best of 3: 509 ms per loop

In [535]: %timeit conv_based(nx, nz, c, rho)
10 loops, best of 3: 15.5 ms per loop

So, for the actual input datasize, the proposed convolution based approach is about 100x faster than the loopy code and about 20x better than the fastest numba based approach numba1.




回答3:


You are not taking full advantage of numpy's capabilities. The numpythonic way of handling your problem would be something like:

cs = np.zeros((nx+1, nz))
np.cumsum(c*rho, axis=0, out=cs[1:])
aa = cs[5:, 2:-3] - cs[1:-4, 2:-3]
bb = cs[4:-1, 2:-3] - cs[:-5, 2:-3]

aa will now hold the central, non-zero part of your a array:

>>> a[:5, :5]
array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  2.31296595,  2.15743042,  2.5853117 ],
       [ 0.        ,  0.        ,  2.02697233,  2.83191016,  2.58819583],
       [ 0.        ,  0.        ,  2.4086584 ,  2.45175615,  2.19628507]])
>>>aa[:3, :3]
array([[ 2.31296595,  2.15743042,  2.5853117 ],
       [ 2.02697233,  2.83191016,  2.58819583],
       [ 2.4086584 ,  2.45175615,  2.19628507]])

and similarly for bb and b.

On my system, with your sample input, this code runs over 300x faster than your numpy function. According to your timings, that is going to be one or two orders of magnitude faster than numba.




回答4:


As stated in the performance section on continuum's blog, autojit compiles just-in-time, while jit compiles ahead of time:

Numba can compile just-in-time with the autojit decorator, or ahead of time with the jit decorator

This means that in many cases, autojit means the compiler can make a more educated guess as to the code it compiles, and optimize thereafter. I know just-in-time compilation ahead of time sounds contradictory, but hey.

But I am wondering why the numba version of the double loop function (numba2) leads to slower execution times

Numba doesn't increase the performance of arbitrary function calls. While I can't say for sure, my guess is that the overhead of the JIT compilation outweighs the benefit from doing it (if there's any benefit at all).

I also noticed that looping first on the smallest loop is slower than looping first on the biggest loop. Is there any explanation ?

This is likely due to a cache miss. A 2-dimensional array is allocated as a contiguous chunk of memory of size rows * columns. What is fetched to cache is based on a combination of temporal (what has recently been used) and spatial (what is close in memory to what has been used) locality, i.e. what is believed to be used next.

When iterating over rows first, you iterate in the order the data appears in memory. When iterating over columns first, you "skip" the width of a row in memory each time, making it less likely that the memory location you're accessing has been fetched to cache.

2D array: [[1,2,3],[4,5,6],[7,8,9]]
In memory:  1 2 3   4 5 6   7 8 9

Let's assume an over-simplified, stupid cache fetch algorithm which fetches 3 subsequent memory locations.

Iterating row-first:

In memory:    1    2    3  |  4    5    6  |  7    8    9
Accessed:     1    2    3  |  4    5    6  |  7    8    9
Cache miss:   -    -    -  |  -    -    -  |  -    -    -

Iterating column-first:

In memory:    1    2    3  |  4    5    6  |  7    8    9
Accessed:     1    4    7  |  2    5    8  |  3    6    9
Cache miss:   -    -    -  |  x    x    x  |  x    x    x


来源:https://stackoverflow.com/questions/30211336/optimize-double-loop-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!