Prefetching Examples?

前端 未结 5 656
南旧
南旧 2020-11-27 10:44

Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantia

5条回答
  •  暖寄归人
    2020-11-27 11:41

    I learned a lot from the excellent answers provided by @JamesScriven and @Mystical. However, their examples give only a modest boost - the objective of this answer is to present a (I must confess somewhat artificial) example, where prefetching has a bigger impact (about factor 4 on my machine).

    There are three possible bottle-necks for the modern architectures: CPU-speed, memory-band-width and memory latency. Prefetching is all about reducing the latency of the memory-accesses.

    In a perfect scenario, where latency corresponds to X calculation-steps, we would have a oracle, which would tell us which memory we would access in X calculation-steps, the prefetching of this data would be launched and it would arrive just in-time X calculation-steps later.

    For a lot of algorithms we are (almost) in this perfect world. For a simple for-loop it is easy to predict which data will be needed X steps later. Out-of-order execution and other hardware tricks are doing a very good job here, concealing the latency almost completely.

    That is the reason, why there is such a modest improvement for @Mystical's example: The prefetcher is already pretty good - there is just not much room for improvement. The task is also memory-bound, so probably not much band-width is left - it could be becoming the limiting factor. I could see at best around 8% improvement on my machine.

    The crucial insight from the @JamesScriven example: neither we nor the CPU knows the next access-address before the the current data is fetched from memory - this dependency is pretty important, otherwise out-of-order execution would lead to a look-forward and the hardware would be able to prefetch the data. However, because we can speculate about only one step there is not that much potential. I was not able to get more than 40% on my machine.

    So let's rig the competition and prepare the data in such a way that we know which address is accessed in X steps, but make it impossible for hardware to find it out due to dependencies on not yet accessed data (see the whole program at the end of the answer):

    //making random accesses to memory:
    unsigned int next(unsigned int current){
       return (current*10001+328)%SIZE;
    }
    
    //the actual work is happening here
    void operator()(){
    
        //set up the oracle - let see it in the future oracle_offset steps
        unsigned int prefetch_index=0;
        for(int i=0;i

    Some remarks:

    1. data is prepared in such a way, that the oracle is alway right.
    2. maybe surprisingly, the less CPU-bound task the bigger the speed-up: we are able to hide the latency almost completely, thus the speed-up is CPU-time+original-latency-time/CPU-time.

    Compiling and executing leads:

    >>> g++ -std=c++11 prefetch_demo.cpp -O3 -o prefetch_demo
    >>> ./prefetch_demo
    #preloops   time no prefetch    time prefetch   factor
    ...
    7   1.0711102260000001  0.230566831 4.6455521002498408
    8   1.0511602149999999  0.22651144600000001 4.6406494398521474
    9   1.049024333 0.22841439299999999 4.5926367389641687
    ....
    

    to a speed-up between 4 and 5.


    Listing of prefetch_demp.cpp:

    //prefetch_demo.cpp
    
    #include 
    #include 
    #include 
    #include 
    
    const int SIZE=1024*1024*1;
    const int STEP_CNT=1024*1024*10;
    
    unsigned int next(unsigned int current){
       return (current*10001+328)%SIZE;
    }
    
    
    template
    struct Worker{
       std::vector mem;
    
       double result;
       int oracle_offset;
    
       void operator()(){
            unsigned int prefetch_index=0;
            for(int i=0;i &mem_):
           mem(mem_), result(0.0), oracle_offset(0)
       {}
    };
    
    template 
        double timeit(Worker &worker){
        auto begin = std::chrono::high_resolution_clock::now();
        worker();
        auto end = std::chrono::high_resolution_clock::now();
        return std::chrono::duration_cast(end-begin).count()/1e9;
    }
    
    
     int main() {
         //set up the data in special way!
         std::vector keys(SIZE);
         for (int i=0;i without_prefetch(keys);
         Worker with_prefetch(keys);
    
         std::cout<<"#preloops\ttime no prefetch\ttime prefetch\tfactor\n";
         std::cout<

提交回复
热议问题