Array size and copy performance

问题

I'm sure this has been answered before, but I can't find a good explanation.

I'm writing a graphics program where a part of the pipeline is copying voxel data to OpenCL page-locked (pinned) memory. I found that this copy procedure is a bottleneck and made some measurements on the performance of a simple std::copy. The data is floats, and every chunk of data that I want to copy is around 64 MB in size.

This is my original code, before any attempts at benchmarking:

std::copy(data, data+numVoxels, pinnedPointer_[_index]);

Where data is a float pointer, numVoxels is an unsigned int and pinnedPointer_[_index] is the float pointer referencing a pinned OpenCL buffer.

Since I got slow performance of that, I decided to try copying smaller parts of the data instead and see what kind of bandwidth I got. I used boost::cpu_timer for timing. I've tried running it for some time as well as averaging over a couple of hundred runs, getting similar results. Here is relevant code along with the results:

boost::timer::cpu_timer t;                                                    
unsigned int testNum = numVoxels;                                             
while (testNum > 2) {                                                         
  t.start();                                                                  
  std::copy(data, data+testNum, pinnedPointer_[_index]);                      
  t.stop();                                                                   
  boost::timer::cpu_times result = t.elapsed();                               
  double time = (double)result.wall / 1.0e9 ;                                 
  int size = testNum*sizeof(float);                                           
  double GB = (double)size / 1073741842.0;                                    
  // Print results  
  testNum /= 2;                                                               
}

Copied 67108864 bytes in 0.032683s, 1.912315 GB/s
Copied 33554432 bytes in 0.017193s, 1.817568 GB/s
Copied 16777216 bytes in 0.008586s, 1.819749 GB/s
Copied 8388608 bytes in 0.004227s, 1.848218 GB/s
Copied 4194304 bytes in 0.001886s, 2.071705 GB/s
Copied 2097152 bytes in 0.000819s, 2.383543 GB/s
Copied 1048576 bytes in 0.000290s, 3.366923 GB/s
Copied 524288 bytes in 0.000063s, 7.776913 GB/s
Copied 262144 bytes in 0.000016s, 15.741867 GB/s
Copied 131072 bytes in 0.000008s, 15.213149 GB/s
Copied 65536 bytes in 0.000004s, 14.374742 GB/s
Copied 32768 bytes in 0.000003s, 10.209962 GB/s
Copied 16384 bytes in 0.000001s, 10.344942 GB/s
Copied 8192 bytes in 0.000001s, 6.476566 GB/s
Copied 4096 bytes in 0.000001s, 4.999603 GB/s
Copied 2048 bytes in 0.000001s, 1.592111 GB/s
Copied 1024 bytes in 0.000001s, 1.600125 GB/s
Copied 512 bytes in 0.000001s, 0.843960 GB/s
Copied 256 bytes in 0.000001s, 0.210990 GB/s
Copied 128 bytes in 0.000001s, 0.098439 GB/s
Copied 64 bytes in 0.000001s, 0.049795 GB/s
Copied 32 bytes in 0.000001s, 0.049837 GB/s
Copied 16 bytes in 0.000001s, 0.023728 GB/s

There is a clear bandwidth peak at copying chunks of 65536-262144 bytes, and the bandwidth is very much higher than copying the full array (15 vs 2 GB/s).

Knowing this, I decided to try another thing and copied the full array, but using repeated calls to std::copy where each call just handled part of the array. Trying different chunk sizes, these are my results:

unsigned int testNum = numVoxels;                                             
unsigned int parts = 1;                                                       
while (sizeof(float)*testNum > 256) {                                         
  t.start();                                                                  
  for (unsigned int i=0; i<parts; ++i) {                                      
    std::copy(data+i*testNum, 
              data+(i+1)*testNum, 
              pinnedPointer_[_index]+i*testNum);
  }                                                                           
  t.stop();                                                                   
  boost::timer::cpu_times result = t.elapsed();                               
  double time = (double)result.wall / 1.0e9;                                  
  int size = testNum*sizeof(float);                                           
  double GB = parts*(double)size / 1073741824.0;                              
  // Print results
  parts *= 2;                                                                 
  testNum /= 2;                                                               
}      

Part size 67108864 bytes, copied 0.0625 GB in 0.0331298s, 1.88652 GB/s
Part size 33554432 bytes, copied 0.0625 GB in 0.0339876s, 1.83891 GB/s
Part size 16777216 bytes, copied 0.0625 GB in 0.0342558s, 1.82451 GB/s
Part size 8388608 bytes, copied 0.0625 GB in 0.0334264s, 1.86978 GB/s
Part size 4194304 bytes, copied 0.0625 GB in 0.0287896s, 2.17092 GB/s
Part size 2097152 bytes, copied 0.0625 GB in 0.0289941s, 2.15561 GB/s
Part size 1048576 bytes, copied 0.0625 GB in 0.0240215s, 2.60184 GB/s
Part size 524288 bytes, copied 0.0625 GB in 0.0184499s, 3.38756 GB/s
Part size 262144 bytes, copied 0.0625 GB in 0.0186002s, 3.36018 GB/s
Part size 131072 bytes, copied 0.0625 GB in 0.0185958s, 3.36097 GB/s
Part size 65536 bytes, copied 0.0625 GB in 0.0185735s, 3.365 GB/s
Part size 32768 bytes, copied 0.0625 GB in 0.0186523s, 3.35079 GB/s
Part size 16384 bytes, copied 0.0625 GB in 0.0187756s, 3.32879 GB/s
Part size 8192 bytes, copied 0.0625 GB in 0.0182212s, 3.43007 GB/s
Part size 4096 bytes, copied 0.0625 GB in 0.01825s, 3.42465 GB/s
Part size 2048 bytes, copied 0.0625 GB in 0.0181881s, 3.43631 GB/s
Part size 1024 bytes, copied 0.0625 GB in 0.0180842s, 3.45605 GB/s
Part size 512 bytes, copied 0.0625 GB in 0.0186669s, 3.34817 GB/s

It seems like decreasing the chunk size actually has a significant effect, but I can't still get anywhere near 15 GB/s.

I run 64 bit Ubuntu, GCC optimization doesn't do much difference.

Why does the array size affect the bandwidth in this way?
Does the OpenCL pinned memory play a part?
What are the strategies for optimizing a large array copy?

回答1:

I'm pretty sure you are running into cache-thrashing. If you fill the cache with data you've written, next time round, some data is needed, the cache will have to read that data from the memory, but FIRST it needs to find some space in the cache - because all the data [or at least a lot of it] is "dirty" because it has been written to, it needs to be written out to RAM. Next we write a new bit of data to the cache, which throws out another bit of data that is dirty (or something we read in earlier).

In assembler, we can overcome this by using a "non-temporal" move instruction. The SSE instruction movntps for example. This instruction will "avoid storing things in the cache".

Edit: You can also get better performance by not mixing reads and writes - use a small buffer [fixed size array] of say 4-16KB, and copy data to that buffer, then write that buffer to the new place where you want it. Again, ideally use non-temporal writes, as that will improve the throughput even in this case - but just using "blocks" to read and then write, rather than read one, write one, will go much faster.

Something like this:

   float temp[2048]; 
   int left_to_do = numVoxels;
   int offset = 0;

   while(left_to_do)
   {
      int block = min(left_to_do, sizeof(temp)/sizeof(temp[0]); 
      std::copy(data+offset, data+offset+block, temp);                      
      std::copy(temp, temp+block, pinnedPointer_[_index+offet]);                      
      offset += block;
      left_to_do -= block;
   }

Try that, and see if it improves things. It may not...

Edit2: I should explain that this is faster because you are re-using the same bit of cache to load data into every time, and by not mixing the reading and writing, we get better performance from the memory itself.

来源：https://stackoverflow.com/questions/16658139/array-size-and-copy-performance

标签

c++

arrays

copy

opencl

bandwidth