Using rdmsr/rdpmc for branch prediction accuracy

问题

I am trying to understand how does a branch prediction unit work in a CPU.

I have used papi and also linux's perf-events but both of them do not give accurate results (for my case).

This is my code:

void func(int* arr, int sequence_len){
  for(int i = 0; i < sequence_len; i++){
      // region starts
      if(arr[i]){
          do_sth();
      }
      // region ends
  }
}

My array consists of 0's and 1's. It has a pattern with a size of sequence_len. For example, if my size is 8, then it has a pattern of 0 1 0 1 0 0 1 1 or something like that.

Trial 1:

I am trying to understand how CPU predicts those branches. So, I have used papi and set up performance counter for branch predictions mispredicted (I know that it also counts indirect branches).

int func(){
  papi_read(r1);
  for(){
    //... same as above
  }
  papi_read(r2);
  return r2-r1;
}

int main(){
   init_papi();
   for(int i = 0; i < 10; i++)
     res[i] = func();

   print(res[i]);
}

What I see as an output is that (for sequence length of 200)

100 #iter1
40  #iter2
10  #iter3
3
0
0
#...

So, at first, the CPU blindly predicts the sequence, only success half of the time. In the next iterations, the CPU can predict better and better. After some amount of iterations, the CPU can guess that perfectly.

Trial 2

I would like to see, at which array index does the CPU misprediction.

int* func(){
  int* results;
  for(){
    papi_read(r1);
    if(arr[i])
        do_sth();   
    papi_read(r2);
    res[i] = r2-r1;
  }
  return res;
}

int main(){
   init_papi();
   for(int i = 0; i < 10; i++)
     res[i] = func();

   print(res[i]);
}

Expected result:

#1st iteration, 0 means no mispred, 1 means mispred
1 0 0 1 1 0 0 0 1 1 0... # total of 200 results
Mispred: 100/200
#2nd iteration
0 0 0 0 1 0 0 0 1 0 0... # total of 200 results
Mispred: 40/200 # it learned from previous iteration
#3rd iteration
0 0 0 0 0 0 0 0 1 0 0... # total of 200 results
Mispred: 10/200 # continues to learn
#...

Received result:

#1st iteration
1 0 0 1 1 0 0 0 1 1 0... # total of 200 results
Mispred: 100/200
#2nd iteration
1 0 0 0 1 1 0 1 0 0 0... # total of 200 results
Mispred: 100/200 # it DID NOT learn from previous iteration
#3rd iteration
0 1 0 1 0 1 0 1 1 0 0... # total of 200 results
Mispred: 100/200 # NO LEARNING
#...

My observation

When I measure the misprediction outside of the for loop, I can see that CPU learns from its mispredictions. However, when I try to measure single branch instructions misprediction, then the CPU either cannot learn, or I am measuring it wrongly.

My explanation

I am giving 200 as a sequence length. The CPU has one small branch predictor, like 2-3 bit saturated counter in Intels, and one big global branch predictor. When I measure outside of the loop, I introduce less noise to the measurement. By less noise, I mean the papi calls.

Think about this: outside of the loop measurement

global history is: papi_start, branch_outcome1, branch_outcome2, branch_outcome3, ..., papi_end, papi_start (2nd loop of main iteration), branch_outcome1, ...

So, the branch predictor somehow finds the pattern in the same branch.

However, if I try to measure single branch instruction then the global history is: papi_start, branchoutcome1, papiend, papistart, branchoutcome2, papiend...

So, I am introducing more and more branches to global history. I assume the global history cannot hold many branch entries and therefore, it cannot find any correlation/pattern in the desired if statement(branch).

As a result

I need to measure a single branch prediction outcome. I know that the CPU can learn the 200 pattern if I don't introduce papi too much. I have looked at the papi calls and I have seen lots of for loops, if conditions.

That is why I need better measurement. I have tried linux perf-event but it makes ioctl calls, which is a system call and I pollute the global history with system calls, and therefore, not a good measurement.

I have read that rdpmc and rdmsr instructions and I assume that since they are only instructions, I will not pollute the global history, and I can measure single branch instruction at a time.

However, I have no clue about how I can do that. I have AMD 3600 CPU. These are the links that I found online but I couldn't figure out how to do that. In addition to that, am I missing something?

Intel rdpmc

AMD Performance manual

回答1:

You have assumed that the PAPI and/or perf_events code has a relatively light footprint. This is incorrect. If you change the performance counter event to something like "instructions retired" or "CPU cycles not halted", you will be able to see how much overhead this operation contains in your software environment. The details will depend on your OS version, but I expect the overhead to be in the hundreds of instructions/thousands of cycles because of the kernel crossing required to read the counters in perf_events (which is used by PAPI). The code path will certainly include its own branches.

If your kernel supports "User-Mode RDPMC" (CR4.PCE=1), you can read a performance counter with a single instruction. Examples are available in https://github.com/jdmccalpin/low-overhead-timers.

Even when limiting the measurement code to the native RDPMC instruction (and the surrounding code to save the results), measurements are disruptive to the processor pipeline. RDPMC is a microcoded instruction. On the Ryzen core, the instruction executes 20 micro-ops and has a throughput of one instruction per 20 cycles. (Ref: https://www.agner.org/optimize/instruction_tables.pdf)

Any measurements at fine granularities are challenging because the out-of-order capabilities of modern processors interact with the user code in ways that are poorly documented and difficult to anticipate. More notes on this topic (also relevant to AMD processors) are at http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/

回答2:

The perf_event_open() documentation describes how to correctly use rdpmc with events created via that interface. The approach described in @JohnDMcCalpin's answer also works, but it's based on programming the event control registers directly. Given a set of hardware events, figuring out how to schedule these events on the available hardware performance counters can be difficult. The perf_event subsystem handles this problem for you, which is a major advantage.

The perf_event subsystem supports rdpmc since Linux 3.4.

Starting with <linux/perf_event.h>, the following works:

do perf_event_open() to prepare to read counter of type = PERF_TYPE_HARDWARE config = PERF_COUNT_HW_BRANCH_MISSES

struct perf_event_attr attr ;
int fd ;

memset(&attr, 0, sizeof(attr)) ;

attr.type   = PERF_TYPE_HARDWARE ;
attr.config = PERF_COUNT_HW_BRANCH_MISSES;
attr.size = sizeof(attr) ;        // for completeness
attr.exclude_kernel = 1 ;         // count user-land events

perf_fd = (int)sys_perf_event_open(&attr, 0, -1, -1, PERF_FLAG_FD_CLOEXEC) ;
                                  // this pid, any cpu, no group_fd

where:

static long
sys_perf_event_open(struct perf_event_attr* attr,
                              pid_t pid, int cpu, int group_fd, ulong flags)
{
  return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags) ;
}

associate the perf_fd with a mmap page:
```
struct perf_event_mmap_page* perf_mm ;

perf_mm = mmap(NULL, page_size, PROT_READ, MAP_SHARED, perf_fd, 0) ;
```
page_size can be 4096 for example. This buffer is used for storing samples. See the "Overflow handling" section of the documentation.

to read the counter need to combine some information in the perf_mm with what you read using RDPMC instruction, thus:

uint64_t  offset, count ;
uint32_t  lock, check, a, d, idx ;

lock = perf_mm->lock ;
do
  {
    check = lock ;
    __asm__ volatile("":::"memory") ;
    idx = perf_mm->index - 1 ;
    // Check that you're allowed to execute rdpmc. You can do this check once.
    // Check also that the event is currently active.
    // Starting with Linux 3.12, use cap_user_rdpmc.
    if (perf_mm->cap_user_rdpmc && idx) {
       // cap_user_rdpmc cannot change at this point because no code
       // that executes here that changes it. So it's safe.
       __asm__ volatile("\t rdpmc\n" : "=a" (a), "=d" (d) : "c" (idx)) ;
    }
    // In case of signed event counts, you have to use also pmc_width.
    // See the docs.
     offset = perf_mm->offset ;
    __asm__ volatile("":::"memory") ;
    lock = perf_mm->lock ;
  }
while (lock != check) ;

count = ((uint64_t)d << 32) + a ;
if (perf_mm->pmc_width != 64)
  {
    // need to sign extend the perf_mm->pmc_width bits of count.
  } ;
count += offset ;

If the thread is not interrupted between the "start" and "end" reads, then I think we can assume that the perf_mm stuff will not change. But if it is interrupted, then the kernel can update perf_mm stuff to account for any changes that affect this timing.

Note: The overhead around the RDPMC instructions is not huge, but I am experimenting with stripping all this back and seeing whether I can use the RDPMC results directly, provided that perf_mm->lock does not change.

来源：https://stackoverflow.com/questions/60265006/using-rdmsr-rdpmc-for-branch-prediction-accuracy

标签

x86

performancecounter

branch-prediction

papi