python - way to do fast matrix multiplication and reduction while working in memmaps and CPU

问题

Hi i have problem doing fast matrix multiplication, addition,function_overwrite and summation with axis reduction and working in numpy.memmaps over CPU without RAM (I think). Only when using numexpr it is possible for me to avoid creating array from dot.

For example:

a=np.require(np.memmap('a.npy',mode='w+',order='C',dtype=np.float64,shape=(10,1)),requirements=['O']) 
b=np.memmap('b.npy',mode='w+',order='C',dtype=np.float64,shape=(1,5))
c=np.memmap('c.npy',mode='w+',order='C',dtype=np.float64,shape=(1,5))
#func -> some method, like i.e. sin()
#in numexpr it will be simple
ne.evaluate('sum(func(b*a+c),axis=1)')
#in numpy with einsum it will have to be with creating additional out-of-dot handling array
d=np.require(np.memmap('d.npy',mode='w+',order='C',dtype=np.float64,shape=(10,5)),requirements=['O']) 

np.einsum('ij,kj->ki',b,a,out=d)
d+=c
func(d,out=d)
np.einsum('ij->i',d,out=c)

Is it even possible to do it faster using CPU without RAM than numexpr do it? How about Cython + FORTRAN lapack or blass? Any tips or tricks are welcome! Thanks for any help!

EDITED INFO: By the way, Im working on laptop with Intel Core2Duo t9300 CPU, 2.7 GB RAM (only it is seen from 4GB, due to some bios problem), SSD 250GB, old Intel GPU. Due to low level of RAM which is used mostly by Firefox with some addons, there is not much left for coding so that is why I'm avoiding its usage xD.

And I feel like I'm at advanced level (step 1/1000) at programming when for now I do not know how code works on hardware - I'm guessing it only (so some mistakes in thinking of mine may appear xD).

EDIT: I made some code in cython for calculating sine waves with numexpr and cython prange for-loop.

Pulsation data (for om, eps, Spectra, Amplitude) is stored in OM numpy.memmap and time data (t, z) in TI numpy.memmap. OM is of shape like (4,1,2500) and TI is shape like (2,1,5e+5,1) - I just need it in that shape.

cdef inline void sine_wave_numexpr(OM,TI,int num_of_threads):

    cdef long m,n=10
    cdef Py_ssize_t s=TI.shape[2]/n
    cdef str ex_sine_wave=r'sum(A*sin(om*ti+eps),axis=1)'
    cdef dict dct={'A':OM[3],'om':OM[0],'eps':OM[2]}
    for m in range(n):
        sl=slice(s*m,s*(m+1))
        dct['ti']=TI[0,0,sl]
        evaluate(ex_sine_wave,
                    global_dict=dct,
                    out=TI[1,0,sl,0])
cdef inline void sine_wave_cython(double[:,:,::1]OM,double[:,:,:,::1]TI,int num_of_threads):
    cdef int i,j
    cdef Py_ssize_t n,m
    cdef double t,A,om,eps
    n=OM.shape[2]
    m=TI.shape[2]
    for i in prange(m,nogil=True,num_threads=num_of_threads):
        t=TI[0,0,i,0]
        for j in prange(n,num_threads=num_of_threads):
            A=OM[3,0,j]
            om=OM[0,0,j]
            eps=OM[2,0,j]
            TI[1,0,i,0]+=A*sin(om*t+eps)

cpdef inline void wave_elevation(double dom,OM,TI,int num_of_threads, str method='cython'):
    cdef int ni
    cdef double i,j
    cdef Py_ssize_t shape=OM.shape[2]
    numexpr_threads(num_of_threads)
    OM[2,0]=2.*np.random.standard_normal(shape)
    evaluate('sqrt(dom*2*S)',out=OM[3],
            local_dict={'dom':dom,'S':OM[1]})
    if method=='cython':
        sine_wave_cython(OM,TI,num_of_threads)
    elif method=='numexpr':
        sine_wave_numexpr(OM,TI,num_of_threads)
    TI.shape=TI.shape[:3]

I'm just starting with Cython, so it may not be well optimised. As for now, code with prange is taking same time as with numexpr (RAM usage is 100 MB for all code with this part included, CPU is 50%, SSD is low - time of calculation is 1-2min). I tried with memoryviews but that created some local copies and used RAM with time going down. I will need to be Advanced level step 3/1000 to understand how to work with memoryviews.

来源：https://stackoverflow.com/questions/32551007/python-way-to-do-fast-matrix-multiplication-and-reduction-while-working-in-mem

标签

python

numpy

numexpr