Basically I have a problem that is pretty much embrassing parallel and I think I\'ve hit the limits of how fast I can make it with plain python & multiprocessing so I\'m
This question is from 3 years ago and nowadays Cython has available functions that support the OpenMP backend. See for example the documentation here. One very convenient function is the prange. This is one example of how a (rather naive) dot function could be implemented using prange.
Don't forget to compile passing the "/opemmp" argument to the C compiler.
import numpy as np
cimport numpy as np
import cython
from cython.parallel import prange
ctypedef np.double_t cDOUBLE
DOUBLE = np.float64
def mydot(np.ndarray[cDOUBLE, ndim=2] a, np.ndarray[cDOUBLE, ndim=2] b):
cdef np.ndarray[cDOUBLE, ndim=2] c
cdef int i, M, N, K
c = np.zeros((a.shape[0], b.shape[1]), dtype=DOUBLE)
M = a.shape[0]
N = a.shape[1]
K = b.shape[1]
for i in prange(M, nogil=True):
multiply(&a[i,0], &b[0,0], &c[i,0], N, K)
return c
@cython.wraparound(False)
@cython.boundscheck(False)
@cython.nonecheck(False)
cdef void multiply(double *a, double *b, double *c, int N, int K) nogil:
cdef int j, k
for j in range(N):
for k in range(K):
c[k] += a[j]*b[k+j*K]