A long term puzzle, how to optimize multi-level loops in python?

前端 未结 2 1359
终归单人心
终归单人心 2020-12-19 21:19

I have written a function in python to calculate Delta function in Gauss broadening, which involves 4-level loops. However, the efficiency is very low, about 10 times slower

2条回答
  •  独厮守ぢ
    2020-12-19 21:46

    Simply compile it

    To get the best performance I recommend Numba (easy usage, good performance). Alternatively Cython may be a good idea, but with a bit more changes to your code.

    You actually got everything right and implemented a easy to understand (for a human and most important for a compiler) solution.

    There are basically two ways to gain performance

    1. Vectorize the code as @scnerd showed. This is usually a bit slower and more complex than simply compile a quite simple code, that only uses some for loops. Don't vectorize your code and than use a compiler. From a simple looping aproach this is usually some work to do and leads to a slower and more complex result. The advantage of this process is that you only need numpy, which is a standard dependency in nearly every Python project that deals with some numerical calculations.

    2. Compile the code. If you have already a solution with a few loops and no other, or only a few non numpy functions involved this is often the simplest and fastest solution.

    A solution using Numba

    You do not have to change much, I changed the pow function to np.power and some slight changes to the way arrays accessed in numpy (this isn't really necessary).

    import numba as nb
    import numpy as np
    
    #performance-debug info
    import llvmlite.binding as llvm
    llvm.set_option('', '--debug-only=loop-vectorize')
    
    @nb.njit(fastmath=True)
    def Delta_Gaussf_nb(Nw, N_bd, N_kp, hw, width,eigv):
        Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=float)
        for w1 in range(Nw):
            for k1 in range(N_kp):
                for i1 in range(N_bd):
                    for j1 in range(N_bd):
                        if ( j1 >= i1 ):
                            Delta_Gauss[w1,k1,i1,j1] = np.exp(np.power((eigv[k1,j1]-eigv[k1,i1]-hw[w1])/width,2))
        return Delta_Gauss
    

    Due to the 'if' the SIMD-vectorization fails. In the next step we can remove it (maybe a call outside the njited function to np.triu(Delta_Gauss) will be necessary). I also parallelized the function.

    @nb.njit(fastmath=True,parallel=True)
    def Delta_Gaussf_1(Nw, N_bd, N_kp, hw, width,eigv):
        Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=np.float64)
        for w1 in nb.prange(Nw):
            for k1 in range(N_kp):
                for i1 in range(N_bd):
                    for j1 in range(N_bd):
                        Delta_Gauss[w1,k1,i1,j1] = np.exp(np.power((eigv[k1,j1]-eigv[k1,i1]-hw[w1])/width,2))
        return Delta_Gauss
    

    Performance

    Nw = 20
    N_bd = 20
    N_kp = 20
    width=20
    hw = np.linspace(0., 1.0, Nw) 
    eigv = np.zeros((N_kp, N_bd),dtype=np.float) 
    
    Your version:           0.5s
    first_compiled version: 1.37ms
    parallel version:       0.55ms
    

    These easy optimizations lead to about 1000x speedup.

提交回复
热议问题