OpenMP multiple threads update same array

后端未结

关注

 2  1771

I have the following code in my program and I want to accelerate it using OpenMP.

...
for(i=curr_index; i < curr_index + rx_size; i+=2){ 
    int64_t tgt


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  甜味超标        
                
              
                            
                2020-12-02 03:02
              
            
            
                                                                       
Yes, the critical section is limiting your performance. You should collect the results locally for each thread and then merge them.

size_t newq_offset = 0;
#pragma omp parallel
{
    // Figure out something clever here...
    const size_t max_newq_per_thread = max_newq / omp_get_num_threads();
    int64_t* local_newq = malloc(max_results_per_thread * sizeof(int64_t));
    size_t local_newq_count = 0;

    #pragma omp parallel for
    for (i=curr_index; i < curr_index + rx_size; i+=2)
        int64_t tgt = rcvq[2*index];
        int64_t src = rcvq[2*index+1];
        if (!TEST(tgt)) {
            pred[tgt] = src;
            local_newq_count++;
            assert(local_newq_count < max_newq_per_thread);
            local_newq[local_newq_count] = tgt;
        }
    }
    int local_offset;
    #pragma omp atomic capture
    {
        local_offset = offset;
        offset += local_newq_count;
    }
    for (size_t i = 0; i < counter; i++)
    {   
        res_global[i + local_offset] = res[i];
    }
}


With this approach, all threads work in parallel on the merging and there is only minimal contention on the atomic capture. Note that you can also make a simple version with atomic capture, that is more efficient than the critical section, but will still become a bottleneck quickly:

size_t newq_count_local;
#pragma omp atomic capture
newq_count_local = newq_count++;
newq[newq_count_local] = tgt;



There is no guarantee about order within newq in any of the versions
Always declare variables as local as possible! Especially when using OpenMP. The critical-version you posted is wrong, because index (defined in an outer scope) is implictly shared among threads.
All of this assumes that there are no duplicates within rcvq. Otherwise, you get a race condition on pred[tgt] = src;.
Your approach of slicing up the loop manually is unnecessarily complicated. No need to do two loops, just use one pragma omp for loop.


The other answer gets the idea right. However, it is C++, not, as tagged, C. There is also a subtle yet significant performance issue with using std::vector<std::vector<>>. Usually a vector is implemented with three pointers, a total of 24 byte. Upon push_back one of the pointers is incremented. This means that, a) the pointers of local vectors from multiple threads reside on the same cache line, and b) on every successful TEST, push_back reads and writes to a cache line that is used by other thread(s). This cache-line will have to move around between cores all the time, greatly limiting the scalability of this approach. This is called false sharing.

I implemented a small test based on the other answer giving the following performance:


0.99 s - single thread
1.58 s - two threads on two neighbouring cores of the same socket
2.13 s - two threads on two cores of different sockets
0.99 s - two threads sharing a single core
0.62 s - 24 threads on two sockets


Whereas above C version scales much better:


0.46 s - single thread (not really comparable C vs C++)
0.24 s - two threads
0.04 s - 24 threads

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  忘了有多久        
                
              
                            
                2020-12-02 03:09
              
            
            
                                                                       
I'm pretty sure omp critical section limiting your performance at this point. 

I'd recommend you to collect the results into separate buffers/vectors and merge them after the parallel processing is done (of course, if the order doesn't matter for you)

vector<vector<int64_t>> res;
res.resize(num_threads);
#pragma omp parallel for
for (index = 0; index < rx_sz/2; ++index) { 
        int64_t tgt = rcvq[2*index];
        int64_t src = rcvq[2*index+1];
        if (!TEST(tgt)) {
            pred[tgt] = src;

            res[omp_get_thread_num()].push_back(tgt);
        }
    }

// Merge all res vectors if needed

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复