In a program I am working on, I need to multiply two matrices repeatedly. Because of the size of one of the matrices, this operation takes some time and I wanted to see whic
You can still optimize these loops by improving the memory acces, your function could look like (assuming the matrizes are 1000x1000):
CS = 10
NCHUNKS = 100
def dot_chunked(A,B):
C = np.zeros(1000,1000)
for i in range(NCHUNKS):
for j in range(NCHUNKS):
for k in range(NCHUNKS):
for ii in range(i*CS,(i+1)*CS):
for jj in range(j*CS,(j+1)*CS):
for kk in range(k*CS,(k+1)*CS):
C[ii,jj] += A[ii,kk]*B[kk,jj]
return C
Explanation: the loops i and ii obviously together perform the same way as i did before, the same hold for j and k, but this time regions in A and B of size CSxCS can be kept in the cache (I guess) and can used more then once.
You can play around with CS and NCHUNKS. For me CS=10 and NCHUNKS=100 worked well. When using numba.jit, it accelerates the code from 7s to 850 ms (notice i use 1000x1000, the graphics above are run with 3x3x10^5, so its a bit of another scenario).