A long term puzzle, how to optimize multi-level loops in python?

前端 未结 2 1380
终归单人心
终归单人心 2020-12-19 21:19

I have written a function in python to calculate Delta function in Gauss broadening, which involves 4-level loops. However, the efficiency is very low, about 10 times slower

2条回答
  •  刺人心
    刺人心 (楼主)
    2020-12-19 21:54

    BLUF: Using Numpy's full functionality, plus another neat module, you can get the Python code down over 100x faster than this raw for-loop code. Using @max9111's answer, however, you can get even faster with much cleaner code and less work.

    The resulting code looks nothing like the original, so I'll do the optimization one step at a time so that the process and final code make sense. Essentially, we're going to use a lot of broadcasting in order to get Numpy to perform the looping under the hood (which is always faster than looping in Python). The result computes the full square of results, which means we're necessarily duplicating some work since the result is symmetrical, but it's easier, and honestly probably faster, to do this work in high-performance ways than to have an if at the deepest level of looping in order to avoid the computation. This might be avoidable in Fortran, but probably not in Python. If you want the result to be identical to your provided source, we'll need to take the upper triangle of the result of my code below (which I do in the sample code below... feel free to remove the triu call in actual production, it's not necessary).

    First, we'll notice a few things. The main equation has a denominator that performs np.sqrt, but the content of that computation doesn't change at any iteration of the loop, so we'll compute it once and re-use the result. This turns out to be minor, but we'll do it anyway. Next, the main function of the inner two loops is to perform eigv[k1][j1] - eigv[k1][i1], which is quite easy to vectorize. If eigv is a matrix, then eigv[k1] - eigv[k1].T produces a matrix where result[i1, j1] = eigv[k1][j1] - eigv[k1][i1]. This allows us to entirely remove the innermost two loops:

    def mine_Delta_Gaussf(Nw, N_bd, N_kp, hw, width, eigv):
        Delta_Gauss = np.zeros((Nw, N_kp, N_bd, N_bd), dtype=float)
        denom = np.sqrt(2.0 * np.pi) * width
        eigv = np.matrix(eigv)
        for w1 in range(Nw):
            for k1 in range(N_kp):
                this_eigv = (eigv[k1] - eigv[k1].T - hw[w1])
                v = np.power(this_eigv / width, 2)
                Delta_Gauss[w1, k1, :, :] = np.exp(-0.5 * v) / denom
    
        # Take the upper triangle to make the result exactly equal to the original code
        return np.triu(Delta_Gauss)
    

    Well, now that we're on the broadcasting bandwagon, it really seems like the remaining two loops should be possible to remove in the same way. As it happens, it is easy! The only thing we need k1 for is to get the row out of eigv that we're trying to pairwise-subtract... so why not do this to all rows at the same time? We're currently basically subtracting matrices of shapes (1, B) - (B, 1) for each of N rows in eigv (where B is is N_bd). We can abuse broadcasting to do this for all rows of eigv simultaneously by subtracting matrices of shapes (N, 1, B) - (N, B, 1) (where N is N_kp):

    def mine_Delta_Gaussf(Nw, N_bd, N_kp, hw, width, eigv):
        Delta_Gauss = np.zeros((Nw, N_kp, N_bd, N_bd), dtype=float)
        denom = np.sqrt(2.0 * np.pi) * width
        for w1 in range(Nw):
            this_eigv = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2) - hw[w1]
            v = np.power(this_eigv / width, 2)
            Delta_Gauss[w1, :, :, :] = np.exp(-0.5 * v) / denom
        return np.triu(Delta_Gauss)
    

    The next step should be clear now. We're only using w1 to index hw, so let's do some more broadcasting to make numpy do the looping instead. We're currently subtracting a scalar value from a matrix of shape (N, B, B), so to get the resulting matrix for each of the W values in hw, we need to perform subtraction on matrices of the shapes (1, N, B, B) - (W, 1, 1, 1) and numpy will broadcast everything to produce a matrix of the shape (W, N, B, B):

    def Delta_Gaussf(hw, width, eigv):
        eigv_sub = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2)
        w_sub = np.expand_dims(eigv_sub, 0) - np.reshape(hw, (0, 1, 1, 1))
        v = np.power(w_sub / width, 2)
        denom = np.sqrt(2.0 * np.pi) * width
        Delta_Gauss = np.exp(-0.5 * v) / denom
        return np.triu(Delta_Gauss)
    

    On my example data, this code is ~100x faster (~900ms to ~10ms). Your mileage might vary.

    But wait! There's more! Since our code is all numeric/numpy/python, we can use another handy module called numba to compile this function into an equivalent one with higher performance. Under the hood, it's basically reading what functions we're calling and converting the function into C-types and C-calls to remove the Python function call overhead. It's doing more than that, but that gives the jist of where we're going to gain benefit. To gain this benefit is trivial in this case:

    import numba
    
    @numba.jit
    def Delta_Gaussf(hw, width, eigv):
        eigv_sub = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2)
        w_sub = np.expand_dims(eigv_sub, 0) - np.reshape(hw, (0, 1, 1, 1))
        v = np.power(w_sub / width, 2)
        denom = np.sqrt(2.0 * np.pi) * width
        Delta_Gauss = np.exp(-0.5 * v) / denom
        return np.triu(Delta_Gauss)
    

    The resulting function is down to about ~7ms on my sample data, down from ~10ms, just by adding that decorator. Pretty nice for no effort.


    EDIT: @max9111 gave a better answer that points out that numba works much better with the loop syntax than with numpy broadcasting code. With almost no work besides the removal of the inner if statement, he shows that numba.jit can be made to get the almost original code even faster. The result is much cleaner, in that you still have just the single innermost equation that shows what each value is, and you don't have to follow the magical broadcasting used above. I highly recommend using his answer.


    Conclusion

    For my given sample data (Nw = 20, N_bd = 20, N_kp = 20), my final runtimes are the following (I've included timings on the same computer for @max9111's solution, first without using parallel execution and then with it on my 2-core VM):

    Original code:               ~900 ms
    Fortran estimate:            ~90 ms (based on OP saying it was ~10x faster)
    Final numpy code:            ~10 ms
    Final code with numba.jit:   ~7 ms
    max9111's solution (serial): ~4ms
    max9111 (parallel 2-core):   ~3ms
    
    Overall vectorized speedup: ~130x
    max9111's numba speedup: ~300x (potentially more with more cores)
    

    I don't know how fast exactly your Fortran code is, but it looks like proper usage of numpy allows you to easily beat it by an order of magnitude, and @max9111's numba solution gives you potentially another order of magnitude.

提交回复
热议问题