cython shared memory in cython.parallel.prange - block

后端未结

关注

 2  680

I have a function foo that takes a pointer to memory as argument and both writes and reads to that memory:

cdef void foo (double *data):
   data


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  梦如初夏        
                
              
                            
                2020-12-01 19:19
              
            
            
                                                                       
I assume that without read or write synchronization locks to data threads will read/write to the memory location and overwrite each other's changes. You will not get consistent results without some kind of synchronization.

Although the docs (http://docs.cython.org/src/userguide/parallelism.html) seem to suggest that OpenMP (the default backend) automatically creates thread locals.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  借酒劲吻你        
                
              
                            
                2020-12-01 19:31
              
            
            
                                                                       
A good way is to have the main array beeing alocated ouside the threads. Then you give to each thread the pointer to the part of the main array that should be computed by the thread.

The following example is an implementation of a matrix multiplication (similar to dot for 2-D arrays) where:

c = a*b


The parallelism here is implemented over the rows of a. Check how the pointers are passed to the multiply function in order allow the different threads to share the same arrays.

import numpy as np
cimport numpy as np
import cython
from cython.parallel import prange

ctypedef np.double_t cDOUBLE
DOUBLE = np.float64


def mydot(np.ndarray[cDOUBLE, ndim=2] a, np.ndarray[cDOUBLE, ndim=2] b):
    cdef np.ndarray[cDOUBLE, ndim=2] c
    cdef int i, M, N, K

    c = np.zeros((a.shape[0], b.shape[1]), dtype=DOUBLE)
    M = a.shape[0]
    N = a.shape[1]
    K = b.shape[1]

    for i in prange(M, nogil=True):
        multiply(&a[i,0], &b[0,0], &c[i,0], N, K)

    return c


@cython.wraparound(False)
@cython.boundscheck(False)
@cython.nonecheck(False)
cdef void multiply(double *a, double *b, double *c, int N, int K) nogil:
    cdef int j, k
    for j in range(N):
        for k in range(K):
            c[k] += a[j]*b[k+j*K]


To check you can use this script:

import time

import numpy as np

import _stack

a = np.random.random((10000,500))
b = np.random.random((500,2000))

t = time.clock()
c = np.dot(a, b)
print('finished dot: {} s'.format(time.clock()-t))

t = time.clock()
c2 = _stack.mydot(a, b)
print('finished mydot: {} s'.format(time.clock()-t))

print 'Passed test:', np.allclose(c, c2)


Where on my computer it gives:

finished dot: 0.601547366526 s
finished mydot: 2.834147917 s
Passed test: True


If the number of rows of a was smaller than then number of cols or the number of cols in b the mydot would be worse, requiring a better check on which dimension to make the parallelism.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复