Prefetching Examples?

前端 未结 5 662
南旧
南旧 2020-11-27 10:44

Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantia

5条回答
  •  悲哀的现实
    2020-11-27 11:33

    Binary search is a simple example that could benefit from explicit prefetching. The access pattern in a binary search looks pretty much random to the hardware prefetcher, so there is little chance that it will accurately predict what to fetch.

    In this example, I prefetch the two possible 'middle' locations of the next loop iteration in the current iteration. One of the prefetches will probably never be used, but the other will (unless this is the final iteration).

     #include 
     #include 
     #include 
    
     int binarySearch(int *array, int number_of_elements, int key) {
             int low = 0, high = number_of_elements-1, mid;
             while(low <= high) {
                     mid = (low + high)/2;
                #ifdef DO_PREFETCH
                // low path
                __builtin_prefetch (&array[(mid + 1 + high)/2], 0, 1);
                // high path
                __builtin_prefetch (&array[(low + mid - 1)/2], 0, 1);
                #endif
    
                     if(array[mid] < key)
                             low = mid + 1; 
                     else if(array[mid] == key)
                             return mid;
                     else if(array[mid] > key)
                             high = mid-1;
             }
             return -1;
     }
     int main() {
         int SIZE = 1024*1024*512;
         int *array =  malloc(SIZE*sizeof(int));
         for (int i=0;i

    When I compile and run this example with DO_PREFETCH enabled, I see a 20% reduction in runtime:

     $ gcc c-binarysearch.c -DDO_PREFETCH -o with-prefetch -std=c11 -O3
     $ gcc c-binarysearch.c -o no-prefetch -std=c11 -O3
    
     $ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./with-prefetch 
    
      Performance counter stats for './with-prefetch':
    
        356,675,702      L1-dcache-load-misses     #   41.39% of all L1-dcache hits  
       861,807,382      L1-dcache-loads                                             
    
       8.787467487 seconds time elapsed
    
     $ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./no-prefetch 
    
     Performance counter stats for './no-prefetch':
    
       382,423,177      L1-dcache-load-misses     #   97.36% of all L1-dcache hits  
       392,799,791      L1-dcache-loads                                             
    
      11.376439030 seconds time elapsed
    

    Notice that we are doing twice as many L1 cache loads in the prefetch version. We're actually doing a lot more work but the memory access pattern is more friendly to the pipeline. This also shows the tradeoff. While this block of code runs faster in isolation, we have loaded a lot of junk into the caches and this may put more pressure on other parts of the application.

提交回复
热议问题