A long term puzzle, how to optimize multi-level loops in python?

前端未结

关注

 2  1380

终归单人心 2020-12-19 21:19

I have written a function in python to calculate Delta function in Gauss broadening, which involves 4-level loops. However, the efficiency is very low, about 10 times slower

2条回答

刺人心 (楼主)

2020-12-19 21:54
BLUF: Using Numpy's full functionality, plus another neat module, you can get the Python code down over 100x faster than this raw for-loop code. Using @max9111's answer, however, you can get even faster with much cleaner code and less work.

The resulting code looks nothing like the original, so I'll do the optimization one step at a time so that the process and final code make sense. Essentially, we're going to use a lot of broadcasting in order to get Numpy to perform the looping under the hood (which is always faster than looping in Python). The result computes the full square of results, which means we're necessarily duplicating some work since the result is symmetrical, but it's easier, and honestly probably faster, to do this work in high-performance ways than to have an if at the deepest level of looping in order to avoid the computation. This might be avoidable in Fortran, but probably not in Python. If you want the result to be identical to your provided source, we'll need to take the upper triangle of the result of my code below (which I do in the sample code below... feel free to remove the triu call in actual production, it's not necessary).

First, we'll notice a few things. The main equation has a denominator that performs np.sqrt, but the content of that computation doesn't change at any iteration of the loop, so we'll compute it once and re-use the result. This turns out to be minor, but we'll do it anyway. Next, the main function of the inner two loops is to perform eigv[k1][j1] - eigv[k1][i1], which is quite easy to vectorize. If eigv is a matrix, then eigv[k1] - eigv[k1].T produces a matrix where result[i1, j1] = eigv[k1][j1] - eigv[k1][i1]. This allows us to entirely remove the innermost two loops:
```
def mine_Delta_Gaussf(Nw, N_bd, N_kp, hw, width, eigv):
    Delta_Gauss = np.zeros((Nw, N_kp, N_bd, N_bd), dtype=float)
    denom = np.sqrt(2.0 * np.pi) * width
    eigv = np.matrix(eigv)
    for w1 in range(Nw):
        for k1 in range(N_kp):
            this_eigv = (eigv[k1] - eigv[k1].T - hw[w1])
            v = np.power(this_eigv / width, 2)
            Delta_Gauss[w1, k1, :, :] = np.exp(-0.5 * v) / denom

    # Take the upper triangle to make the result exactly equal to the original code
    return np.triu(Delta_Gauss)
```
Well, now that we're on the broadcasting bandwagon, it really seems like the remaining two loops should be possible to remove in the same way. As it happens, it is easy! The only thing we need k1 for is to get the row out of eigv that we're trying to pairwise-subtract... so why not do this to all rows at the same time? We're currently basically subtracting matrices of shapes (1, B) - (B, 1) for each of N rows in eigv (where B is is N_bd). We can abuse broadcasting to do this for all rows of eigv simultaneously by subtracting matrices of shapes (N, 1, B) - (N, B, 1) (where N is N_kp):
```
def mine_Delta_Gaussf(Nw, N_bd, N_kp, hw, width, eigv):
    Delta_Gauss = np.zeros((Nw, N_kp, N_bd, N_bd), dtype=float)
    denom = np.sqrt(2.0 * np.pi) * width
    for w1 in range(Nw):
        this_eigv = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2) - hw[w1]
        v = np.power(this_eigv / width, 2)
        Delta_Gauss[w1, :, :, :] = np.exp(-0.5 * v) / denom
    return np.triu(Delta_Gauss)
```
The next step should be clear now. We're only using w1 to index hw, so let's do some more broadcasting to make numpy do the looping instead. We're currently subtracting a scalar value from a matrix of shape (N, B, B), so to get the resulting matrix for each of the W values in hw, we need to perform subtraction on matrices of the shapes (1, N, B, B) - (W, 1, 1, 1) and numpy will broadcast everything to produce a matrix of the shape (W, N, B, B):
```
def Delta_Gaussf(hw, width, eigv):
    eigv_sub = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2)
    w_sub = np.expand_dims(eigv_sub, 0) - np.reshape(hw, (0, 1, 1, 1))
    v = np.power(w_sub / width, 2)
    denom = np.sqrt(2.0 * np.pi) * width
    Delta_Gauss = np.exp(-0.5 * v) / denom
    return np.triu(Delta_Gauss)
```
On my example data, this code is ~100x faster (~900ms to ~10ms). Your mileage might vary.

But wait! There's more! Since our code is all numeric/numpy/python, we can use another handy module called numba to compile this function into an equivalent one with higher performance. Under the hood, it's basically reading what functions we're calling and converting the function into C-types and C-calls to remove the Python function call overhead. It's doing more than that, but that gives the jist of where we're going to gain benefit. To gain this benefit is trivial in this case:
```
import numba

@numba.jit
def Delta_Gaussf(hw, width, eigv):
    eigv_sub = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2)
    w_sub = np.expand_dims(eigv_sub, 0) - np.reshape(hw, (0, 1, 1, 1))
    v = np.power(w_sub / width, 2)
    denom = np.sqrt(2.0 * np.pi) * width
    Delta_Gauss = np.exp(-0.5 * v) / denom
    return np.triu(Delta_Gauss)
```
The resulting function is down to about ~7ms on my sample data, down from ~10ms, just by adding that decorator. Pretty nice for no effort.

EDIT: @max9111 gave a better answer that points out that numba works much better with the loop syntax than with numpy broadcasting code. With almost no work besides the removal of the inner if statement, he shows that numba.jit can be made to get the almost original code even faster. The result is much cleaner, in that you still have just the single innermost equation that shows what each value is, and you don't have to follow the magical broadcasting used above. I highly recommend using his answer.

Conclusion

For my given sample data (Nw = 20, N_bd = 20, N_kp = 20), my final runtimes are the following (I've included timings on the same computer for @max9111's solution, first without using parallel execution and then with it on my 2-core VM):
```
Original code:               ~900 ms
Fortran estimate:            ~90 ms (based on OP saying it was ~10x faster)
Final numpy code:            ~10 ms
Final code with numba.jit:   ~7 ms
max9111's solution (serial): ~4ms
max9111 (parallel 2-core):   ~3ms

Overall vectorized speedup: ~130x
max9111's numba speedup: ~300x (potentially more with more cores)
```
I don't know how fast exactly your Fortran code is, but it looks like proper usage of numpy allows you to easily beat it by an order of magnitude, and @max9111's numba solution gives you potentially another order of magnitude.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...