Is there a way to speed up Numpy array calculations when they only contain values in upper/lower triangle?

问题

I'm doing some matrix calculations (2d) that only involve values in the upper triangle of the matrices.

So far I've found that using Numpy's triu method ("return a copy of a matrix with the elements below the k-th diagonal zeroed") works and is quite fast. But presumably, the calculations are still being carried out for the whole matrix, including unnecessary calculations on the zeros. Or are they?...

Here is an example of what I tried first:

# Initialize vars
N = 160
u = np.empty(N) 
u[0] = 1000
u[1:] = np.cumprod(np.full(N-1, 1/2**(1/16)))*1000
m = np.random.random(N)

def method1():
    # Prepare matrices with values only in upper triangle
    ones_ut = np.triu(np.ones((N, N)))
    u_ut = np.triu(np.broadcast_to(u, (N, N)))
    m_ut = np.triu(np.broadcast_to(m, (N, N)))

    # Do calculation
    return (ones_ut - np.divide(u_ut, u.reshape(N, 1)))**3*m_ut

Then I realized I only need to zero-out the final result matrix:

def method2():
    return np.triu((np.ones((N, N)) - np.divide(u, u.reshape(N, 1)))**3*m)
assert np.array_equal(method1(), method2())

But to my surprise, this was slower.

In [62]: %timeit method1()                                                                                
662 µs ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [63]: %timeit method2()                                                                                
836 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Does numpy do some kind of special optimization when it knows the matrices contain half zeros?

I'm curious about why it is slower but actually my main question is, is there a way to speed up vectorized calculations by taking account of the fact that you are not interested in half the values in the matrix?

UPDATE

I tried just doing the calculations over 3 of the quadrants of the matrices but it didn't achieve any speed increase over method 1:

def method4(): 
    split = N//2 
    x = np.zeros((N, N)) 
    u_mat = 1 - u/u.reshape(N, 1) 
    x[:split, :] = u_mat[:split,:]**3*m 
    x[split:, split:] = u_mat[split:, split:]**3*m[split:] 
    return np.triu(x) 
assert np.array_equal(method1(), method4())

In [86]: %timeit method4()                                                                                
683 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But this is faster than method 2.

回答1:

Here is another solution that is faster in some cases but slower in some other cases.

idx = np.triu_indices(N)

def my_method():
    result = np.zeros((N, N))
    t = 1 - u[idx[1]] / u[idx[0]]
    result[idx] = t * t * t * m[idx[1]]
    return result

Here, the computation is done only for the elements in the (flattened) upper triangle. However, there is overhead in the 2D-index-based assignment operation result[idx] = .... So the method is faster when the overhead is less than the saved computations -- which happens when N is small or the computation is relatively complex (e.g., using t ** 3 instead of t * t * t).

Another variation of the method is to use 1D-index for the assignment operation, which can lead to a small speedup.

idx = np.triu_indices(N)
raveled_idx = np.ravel_multi_index(idx, (N, N))
def my_method2():
    result = np.zeros((N, N))
    t = 1 - u[idx[1]] / u[idx[0]]
    result.ravel()[raveled_idx] = t * t * t * m[idx[1]]
    return result

Following is the result of performance tests. Note that idx and raveled_idx and are fixed for each N and do not change with u and m (as long as their shapes remain unchanged). Hence their values can be precomputed and the times are excluded from the test. (If you need to call these methods with matrices of many different sizes, there will be added overhead in the computations of idx and raveled_idx.) For the comparision, method4b, method5 and method6 cannot benefit much from any precomputation. For method_ne, the precomputation M = np.tri(N, dtype=bool) is also excluded from the test.

%timeit method4b()
%timeit method5()
%timeit method6()
%timeit method_ne()
%timeit my_method()
%timeit my_method2()

Result (for N = 160):

1.54 ms ± 7.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.63 ms ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
167 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
255 µs ± 14.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
233 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
177 µs ± 907 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

For N = 32:

89.9 µs ± 880 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
84 µs ± 728 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
25.2 µs ± 223 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
28.6 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
17.6 µs ± 1.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
14.3 µs ± 52.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

For N = 1000:

70.7 ms ± 871 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
65.1 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
21.4 ms ± 642 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.03 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.2 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.7 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using t ** 3 instead of t * t * t in my_method and my_method2 (N = 160):

1.53 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.6 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
156 µs ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
235 µs ± 8.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.4 ms ± 4.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.32 ms ± 9.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Here, my_method and my_method2 outperform method4b and method5 a little bit.

回答2:

We should simplify things there to leverage broadcasting at minimal places. Upon which, we would end up with something like this to directly get the final output using u and m, like so -

np.triu((1-u/u.reshape(N, 1))**3*m)

Then, we could leverage numexpr module that performs noticeably better when working with transcendental operations as is the case here and also is very memory efficient. So, upon porting to numexpr version, it would be -

import numexpr as ne

np.triu(ne.evaluate('(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)}))

Bring in the masking part within the evaluate method for further perf. boost -

M = np.tri(N,dtype=bool)
ne.evaluate('(1-M)*(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)})

Timings on given dataset -

In [25]: %timeit method1()
1000 loops, best of 3: 521 µs per loop

In [26]: %timeit method2()
1000 loops, best of 3: 417 µs per loop

In [27]: %timeit np.triu((1-u/u.reshape(N, 1))**3*m)
1000 loops, best of 3: 408 µs per loop

In [28]: %timeit np.triu(ne.evaluate('(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)}))
10000 loops, best of 3: 159 µs per loop

In [29]: %timeit ne.evaluate('(1-M)*(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1),'M':np.tri(N,dtype=bool)})
10000 loops, best of 3: 110 µs per loop

Note that another way to extend u to a 2D version would be with np.newaxis/None and this would be the idiomatic way. Hence, u.reshape(N, 1) could be replaced by u[:,None]. This shouldn't change the timings though.

回答3:

I think the answer may be quite simple. Just put zeros in the cells that you don't want to calculate and the overall calculation will be faster. I think that might explain why method1() was faster than method2().

Here are some tests to illustrate the point.

In [29]: size = (160, 160)                                                      

In [30]: z = np.zeros(size)                                                     

In [31]: r = np.random.random(size) + 1                                         

In [32]: t = np.triu(r)                                                         

In [33]: w = np.ones(size)                                                      

In [34]: %timeit z**3                                                           
177 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [35]: %timeit t**3                                                           
376 µs ± 2.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [36]: %timeit r**3                                                           
572 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [37]: %timeit w**3                                                           
138 µs ± 548 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [38]: %timeit np.triu(r)**3                                                  
427 µs ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [39]: %timeit np.triu(r**3)                                                  
625 µs ± 3.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Not sure how all this works at a low level but clearly, zero or one raised to a power takes much less time to compute than any other value.

Also interesting. With numexpr computation there is no difference.

In [42]: %timeit ne.evaluate("r**3")                                            
79.2 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [43]: %timeit ne.evaluate("z**3")                                            
79.3 µs ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So, I think the fastest without using numexpr may be this way:

def method5(): 
    return np.triu(1 - u/u[:, None])**3*m  
assert np.array_equal(method1(), method5())

In [65]: %timeit method1()                                                      
656 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [66]: %timeit method5()                                                      
587 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Or, if you are really chasing every micro-second:

def method4b():  
    split = N//2  
    x = np.zeros((N, N))  
    u_mat = np.triu(1 - u/u.reshape(N, 1))  
    x[:split, :] = u_mat[:split,:]**3*m  
    x[split:, split:] = u_mat[split:, split:]**3*m[split:]  
    return x 
assert np.array_equal(method1(), method4b())

In [71]: %timeit method4b()                                                     
543 µs ± 3.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [72]: %timeit method4b()                                                     
533 µs ± 7.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

And @Divakar's answer using numexpr is the fastest overall.

UPDATE

Thanks to @GZ0's comment, if you only need to raise to the power of 3, this is much faster:

def method6(): 
    a = np.triu(1 - u/u[:, None]) 
    return a*a*a*m 
assert np.isclose(method1(), method6()).all()

(But there is a slight loss of precision I noticed).

In [84]: %timeit method6()                                                      
195 µs ± 609 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In fact it is not far off the numexpr methods in @Divakar's answer (185/163 µs on my machine).

来源：https://stackoverflow.com/questions/57099692/is-there-a-way-to-speed-up-numpy-array-calculations-when-they-only-contain-value

标签

python

performance

numpy

matrix

vectorization