Branch Predictor Entries Invalidation upon program finishes?

问题

I am trying to understand when branch predictor entries are invalidated.

Here are the experiments I have done:

Code1:

start_measure_branch_mispred()
while(X times):
 if(something something):
  do_useless()
 endif
endwhile
end_measurement()
store_difference()

So, I am running this code a number of times. I can see that after the first run, the misprediction rates go lower. The branch predictor learns how to predict correctly. But, if I run this experiment again and again (i.e. by writing ./experiment to the terminal), all the first iterations are starting from high misprediction rates. So, at each execution, the branch prediction units for those conditional branches are invalidated. I am using nokaslr and I have disabled ASLR. I also run this experiment on an isolated core. I have run this experiment a couple of times to make sure this is the behavior (i.e. not because of the noise).

My question is: Does CPU invalidate branch prediction units after the program stops its execution? Or what is the cause of this?

The second experiment I have done is:

Code 2:

do:
    start_measure_branch_mispred()
    while(X times):
      if(something something):
        do_useless()
      endif
    endwhile
    end_measurement()
    store_difference()
while(cpu core == 1)

In this experiment, I am running the different processes from two different terminals. The first one is pinned to the core 1 so that it will run on the core 1 and it will do this experiment until I stop it (by killing it). Then, I am running the second process from another terminal and I am pinning the process to different cores. As this process is in a different core, it will only execute the do-while loop 1 time. If the second process is pinned to the sibling core of the first one (same physical core), I see that in the first iteration, the second process guess almost correctly. If I pin the second process another core which is not the sibling of the first one, then the first iteration of the second process makes higher mispredictions. This is expected results because virtual cores on the same physical core share the same branch prediction units (that is my assumption). So, the second process benefits the trained branch prediction units as they have the same virtual address and map to the same branch prediction unit entry.

As far as I understand, since the CPU is not done with the first process (core 1 process that does the busy loop), the branch prediction entries are still there and the second process can benefit from this. But, in the first one, from run to run, I get higher mispredictions.

EDIT: As the other user asked for the code, here it is. You need to download performance events header code from here

To compile: $(CXX) -std=c++11 -O0 main.cpp -lpthread -o experiment

The code:

#include "linux-perf-events.h"

#include <algorithm>
#include <climits>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <vector>

// some array
int arr8[8] = {1,1,0,0,0,1,0,1};

int pin_thread_to_core(int core_id){            
    int retval;     
    int num_cores = sysconf(_SC_NPROCESSORS_ONLN);      
    if (core_id < 0 || core_id >= num_cores)            
        retval = EINVAL;                                
    cpu_set_t cpuset;                                   
    CPU_ZERO(&cpuset);                                  
    CPU_SET(core_id, &cpuset);                          
    retval = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
    return retval;
}

void measurement(int cpuid, uint64_t howmany, int* branch_misses){

    int retval = pin_thread_to_core(cpuid);
    if(retval){
        printf("Affinity error: %s\n", strerror(errno));
        return;
    }

    std::vector<int> evts;
    evts.push_back(PERF_COUNT_HW_BRANCH_MISSES); // You might have a different performance event!

    LinuxEvents<PERF_TYPE_HARDWARE> unified(evts, cpuid); // You need to change the constructor in the performance counter so that it will count the events in the given cpuid

    uint64_t *buffer = new uint64_t[howmany + 1];
    uint64_t *buffer_org; // for restoring
    buffer_org = buffer;
    uint64_t howmany_org = howmany; // for restoring

    std::vector<unsigned long long> results;
    results.resize(evts.size());

    do{
        for(size_t trial = 0; trial < 10; trial++) {

            unified.start();
            // the while loop will be executed innerloop times
            int res;
            while(howmany){
                res = arr8[howmany & 0x7]; // do the sequence howmany/8 times
                if(res){
                    *buffer++ = res;
                }       
                howmany--;
            }
            unified.end(results);
            // store misses
            branch_misses[trial] = results[0];
            // restore for next iteration
            buffer = buffer_org;
            howmany = howmany_org;
        }
    }while(cpuid == 5); // the core that does busy loop

    // get rid of optimization
    howmany = (howmany + 1) * buffer[3];
    branch_misses[10] = howmany; // last entry is reserved for this dummy operation

    delete[] buffer;

}
void usage(){
    printf("Run with ./experiment X \t where X is the core number\n");
}
int main(int argc, char *argv[]) {
    // as I have 11th core isolated, set affinity to that
    if(argc == 1){
        usage();
        return 1;
    }

    int exp = 16; // howmany

    int results[11];
    int cpuid = atoi(argv[1]); 

    measurement(cpuid, exp, results);

    printf("%d measurements\n", exp);

    printf("Trial\t\t\tBranchMiss\n");
    for (size_t trial = 0; trial < 10; trial++)
    {
        printf("%zu\t\t\t%d\n", trial, results[trial]);
    }
    return 0;
}

If you want to try the first code, just run ./experiment 1 twice. It will have the same execution as the first code.

If you want to try the second code, open two terminals, run ./experiment X in the first one, and run ./experiment Y in the second one, where X and Y are cpuid's.

Note that, you might not have the same performance event counter. Also, note that you might need to change the cpuid in the busyloop.

回答1:

So, I have conducted more experiments to reduce the effect of noise (either from _start until main() functions or from syscalls and interrupts that can happen between two program execution which (syscalls and interrupts) can corrupt the branch predictors.

Here is the pseudo-code of the modified experiment:

int main(int arg){ // arg is the iteration
   pin_thread_to_isolated_core()
   for i=0 to arg:
     measurement()
     std::this_thread::sleep_for(std::chrono::milliseconds(1)); // I put this as it is
   endfor
   printresults() // print after all measurements are completed
}

void measurement(){
   initialization()
   for i=0 to 10:
      start_measurement()
      while(X times) // for the results below, X is 32
        a = arr8[an element] //sequence of 8,
        if(a is odd)
           do_sth()
        endif
      endwhile
      end_measurement()
      store_difference()
   endfor
}

And, these are the results:

For example, I give iteration as 3

Trial           BranchMiss
RUN:1
    0           16
    1           28
    2           3
    3           1
    ....  continues as 1
RUN:2
    0           16   // CPU forgets the sequence
    1           30
    2           2
    3           1
    ....  continues as 1
RUN:3
    0           16
    1           27
    2           4
    3           1
    ....  continues as 1

So, even a millisecond sleep can disturb the branch prediction units. Why is that the case? If I don't put a sleep between those measurements, the CPU can correctly guess, i.e. the Run2 and Run3 will look like below:

RUN:2
    0           1   
    1           1
    ....  continues as 1
RUN:3
    0           1
    1           1
    ....  continues as 1

I believe I diminish the branch executions from _start to the measurement point. Still, the CPU forgets the trained thing.

回答2:

Does CPU invalidate branch prediction units after the program stops its execution?

No, the CPU has no idea if/when a program stops execution.

The branch prediction data only makes sense for one virtual address space, so when you switch to a different virtual address space (or when the kernel switches to a different address space, rips the old virtual address space apart and converts its page tables, etc. back into free RAM, then constructs an entirely new virtual address space when you start the program again) all of the old branch predictor data is no longer valid for the new (completely different and unrelated, even if the contents happen to be the same) virtual address space.

If the second process is pinned to the sibling core of the first one (same physical core), I see that in the first iteration, the second process guess almost correctly.

This is expected results because virtual cores on the same physical core share the same branch prediction units (that is my assumption).

In a perfect world; a glaring security vulnerability (branch predictor state, that can be used to infer information about the data that caused it, being leaked from a victim's process on one logical processor to an attacker's process on a different logical processor in the same core) is not what I'd expect.

The world is somewhat less than perfect. More specifically, in a perfect world branch predictor entries would have "tags" (meta-data) containing which virtual address space and the full virtual address (and which CPU mode) the entry is valid for, and all of this information would be checked by the CPU before using the entry to predict a branch; however that's more expensive and slower than having smaller tags with less information, accidentally using branch predictor entries that are not appropriate, and ending up with "spectre-like" security vulnerabilities.

Note that this is a known vulnerability that the OS you're using failed to mitigate, most likely because you disabled the first line of defense against this kind of vulnerability (ASLR).

回答3:

TL:DR: power-saving deep sleep states clear branch-predictor history. Limiting sleep level to C3 preserves it on Broadwell.

For branch history to be useful across runs, it also helps to disable ASLR (so virtual addresses are the same), for example with a non-PIE executable.

Also, isolate the process on a single core because branch predictor entries are local to a physical core on Intel CPUs. Core isolation is not really absolutely necessary, though. If you run the program for many times consecutively on a mostly idle system, you'll find that sometimes it works, but not always. Basically, any task that happens to run on the same core, even for a short time, can pollute the branch predictor state. So running on an isolated core helps getting more stable results, especially on a busy system.

There are several factors that impact the measured number of branch mispredictions, but it's possible to isolate them from one another to determine what is causing these mispredictions. I need to introduce some terminology and my experimental setup first before discussing the details.

I'll use the version of the code from the answer you've posted, which is more general than the one shown in the question. The following code shows the most important parts:

void measurement(int cpuid, uint64_t howmany, int* branch_misses) {
    ...
        for(size_t trial = 0; trial < 4; trial++) {

            unified.start();
            int res;
            for(uint64_t tmp = howmany; tmp; tmp--) {
                res = arr8[tmp & 0x7];
                if(res){
                    *buffer++ = res;
                }
            }
            unified.end(results);
            ...
        }
    ...
}

int main(int argc, char *argv[]) {
    ...
    for(int i = 0; i < 3; ++i) {
        measurement(cpuid, exp, results);
        std::this_thread::sleep_for(std::chrono::milliseconds(1));
    }
    ...
}

A single execution of this program performs multiple sets of measurements of the number of branch mispredictions (the event BR_MISP_RETIRED.ALL_BRANCHES on Intel processors) of the while loop in the measurement function. Each set of measurements is followed by a call to sleep_for() to sleep for 1ms. Measurements within the same set are only separated by calls to unified.start() and unified.end(), which internally perform transitions to kernel mode and back to user mode. I've experimentally determined that it's sufficient for the number of measurements within a set to be 4 and the number of sets to be 3 because the number of branch mispredictions doesn't change beyond that. In addition, the exact location of the call to pin_thread_to_core in the code doesn't seem to be important, which indicates that there is no pollution from the code that surrounds the region of interest.

In all my experiments, I've compiled the code using gcc 7.4.0 -O0 and run it natively on a system with Linux 4.15.0 and an Intel Broadwell processor with hyperthreading disabled. As I'll discuss later, it's important to see what kinds of branches there are in the region of interest (i.e., the code for which the number of branch mispredictions is being measured). Since you've limited the event count to only user-mode events (by setting perf_event_attr.exclude_kernel to 1), you only to consider the user-mode code. But using the -O0 optimization level and C++ makes the native code a little ugly.

The unified.start() function contains two calls to ioctl()but user-mode event are measured only after returning from the second call. Starting from that location in unified.start(), there are a bunch of calls to PLTs (which contain only unconditional direct jumps), a few direct jumps, and a ret at the end. The while loop is implemented as a couple of conditional and unconditional direct jumps. Then there is a call to unified.end(), which calls ioctl to transition to kernel-mode and disable event counting. In the whole region of interest, there are no indirect branches other than a single ret. Any ret or a conditional jump instruction may generate a branch misprediction event. Indirect jumps and calls can also generate misprediction events had they existed. It's important to know this because an active Spectre v2 mitigation can change the state of the buffer used for predicting indirect branches other than rets (called BTB). According to the kernel log, the following Spectre mitigations are used on the system:

Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization Spectre V2 : Mitigation: Full generic retpoline
Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
Spectre V2 : Enabling Restricted Speculation for firmware calls
Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier

The experimental setup described above is the baseline setup. Some of the experiments discussed below use additional compilation options or kernel parameters. First, I've use the intel_idle.max_cstate to limit the deepest Core C-state that the kernel can use. Broadwell supports the following Core C-states: C0, C1, C1E, C3, C6, and C7. I needed to only use to two max_cstate values, namely 3 and 6 so that the kernel doesn't use Core C-states below C3 and C6, respectively. Some experiments were run on a core isolated with the isolcpus kernel parameter. Finally, some experiments use code compiled with the -no-pie option, which disables PIE. All other kernel parameters have the default values. In particular, CPU vulnerability mitigations are always enabled.

The following figure shows the number of mispredictions measured in different configurations. I've followed the following experimental methodology:

Configure the system as required for the experiment to be conducted. Then the system is restarted so that the state of the branch prediction buffers is the same as the one used for other experiments.
The program is run ten consecutive times on the terminal. If isolcpus is used in the configuration, the program is always run on the isolated core.
There are three sets of four measurements in each of the ten runs. The four measurements of the first set of the first run are not shown in the figure because the numbers are the practically same in all configurations. They are basically 15, 6, 3, and 2 mispredictions. These are the training runs for the branch predictor, so it's expected that the number of misprediction will be high for the first measurment and that it will decrease in later measurement as the branch predictor learns. Increasing the number of measurements in the same set doesn't reduce the number of mispredictions any further. The rest of the measurements are plotted in the figure. The 12 bars of each configuration correspond to the 12 measurements performed in a single run in the same order. The numbers are averaged over the ten runs (except that the numbers of the first set of the first run are not included in the average in the first four bars). The label sXmY in the figure refers to the average number of mispredictions over the ten runs for the measurement Y of the set X.

The first configuration is essentially equivalent to the default. The first measurement of the first set indicates whether the branch predictor has retained what it has learned in the previous run of the experiment. The first measurements of the other two sets indicate whether the branch predictor has retained what it has learned in the previous set of measurements in the same run despite of the call to sleep_for. It's clear that the branch predictor has failed to retain this information in both cases in the first configuration. This is also the case in the next three configurations. In all of these configurations, intel_idle.max_cstate is set to 6, meaning that the cpuidle subsystem can choose to put a core into C6 when it has an empty runqueue. This is expected because C6 is power-gating state.

In the fifth configuration, intel_idle.max_cstate is set to 3, meaning that the deepest C-state the kernel is allowed to use is C3, which is a clock-gating state. The results indicate that the branch predictor can now retain its information across calls to sleep_for. Using a tool like strace, you can confirm that sleep_for always invokes the nanosleep system call irrespective of intel_idle.max_cstate. This means that user-kernel transitions cannot be the reason for polluting the branch prediction history in the previous configurations and that the C-state must be the influencing factor here.

Broadwell supports automatic promotion and demotion of C-states, which means that the hardware itself can change the C-state to something different than what the kernel has requested. The results may be a little perturbed if these features are not disabled, but I didn't find this to be an issue. I've observed that the number of cycles spent in C3 or C6 (depending on intel_idle.max_cstate) increases with the number of sets of measurements.

In the fifth configuration, the first bar is as high as in the previous configurations though. So the branch predictor is still not able to remember what it has learned in the first run. The sixth and seventh configurations are similar.

In the eighth configuration, the first bar is significantly lower than in the earlier configurations, which indicates that the branch predictor can now benefit from what it has learned in a previous run of the same program. This is achieved by using two configuration options in addition to setting intel_idle.max_cstate to 3: disabling PIE and running on an isolated core. Although it's not clear from the graph, both options are required. The kernel can randomize the base address of PIE binaries, which changes addresses of all branch instructions. This makes it more likely that the same static branch instructions to map to different branch buffer entries than in the previous run. So what the branch predictor has learned in the previous run is still there in its buffers, but it cannot utilize this information anymore because the linear addresses of the branches have changed. The fact that running on an isolated core is necessary indicates that it's common for the kernel to run short tasks on idle cores, which pollute the branch predictor state.

The first four bars of the eight configuration show that the branch predictor is still learning about one or two branch instructions that are in the region of interest. Actually, all of the remaining branch mispredictions are not for branches in the while loop. To show, the experiments can be repeated on the same code but without the while loop (i.e., there is nothing between unified.start() and unified.end()). This is the ninth configuration. Observe how the number of mispredictions is about the same.

The first bar is still a little higher than the others. Also it seems that there are branches that the branch predictor is having a hard time predicting. The tenth configuration takes -no-pie one step further and disables ASLR completely. This makes the first bar about equal to the others, but doesn't get rid of the two mispredictions. perf record -e cpu/branch-misses/uppp -c 1 can be used to find out which branches are being mispredicted. It tells me that the only branch in the region of interest that is being mispredicted is a branch instruction in the PTL of ioctl. I'm not sure which two branches are being mispredicted and why.

Regarding sharing branch prediction entries between hyperthreads, we know that some of the buffers are shared. For example, we know from the Spectre attack that the BTB is shared between hyperthreads on at least some Intel processors. According to Intel:

As noted in descriptions of Indirect Branch Prediction and Intel® Hyper-Threading Technology (Intel® HT Technology)”, logical processors sharing a core may share indirect branch predictors, allowing one logical processor to control the predicted targets of indirect branches by another logical processor of the same core. . . .
Recall that indirect branch predictors are never shared across cores.

Your results also suggest that the BHT is shared. We also know that the RSB is not shared. In general, this is a design choice. These structures don't have to be like that.

来源：https://stackoverflow.com/questions/59143138/branch-predictor-entries-invalidation-upon-program-finishes

标签

performance

x86

branch-prediction

spectre