How are x86 uops scheduled, exactly?

此生再无相见时 提交于 2019-11-26 16:32:35

Your questions are tough for a couple of reasons:

  1. The answer depends a lot on the microarchitecture of the processor which can vary significantly from generation to generation.
  2. These are fine-grained details which Intel doesn't generally release to the public.

Nevertheless, I'll try to answer...

When multiple uops are ready in the reservation station, in what order are they scheduled to ports?

It should be the oldest [see below], but your mileage may vary. The P6 microarchitecture (used in the Pentium Pro, 2 & 3) used a reservation station with five schedulers (one per execution port); the schedulers used a priority pointer as a place to start scanning for ready uops to dispatch. It was only pseudo FIFO so it's entirely possible that the oldest ready instruction was not always scheduled. In the NetBurst microarchitecture (used in Pentium 4), they ditched the unified reservation station and used two uop queues instead. These were proper collapsing priority queues so the schedulers were guaranteed to get the oldest ready instruction. The Core architecture returned to a reservation station and I would hazard an educated guess that they used the collapsing priority queue, but I can't find a source to confirm this. If somebody has a definitive answer, I'm all ears.

When a uop can go to multiple ports (like the add and lea in the example above), how is it decided which port is chosen?

That's tricky to know. The best I could find is a patent from Intel describing such a mechanism. Essentially, they keep a counter for each port that has redundant functional units. When the uops leave the front end to the reservation station, they are assigned a dispatch port. If it has to decide between multiple redundant execution units, the counters are used to distribute the work evenly. Counters are incremented and decremented as uops enter and leave the reservation station respectively.

Naturally this is just a heuristic and does not guarantee a perfect conflict-free schedule, however, I could still see it working with your toy example. The instructions which can only go to one port would ultimately influence the scheduler to dispatch the "less restricted" uops to other ports.

In any case, the presence of a patent doesn't necessarily imply that the idea was adopted (although that said, one of the authors was also a tech lead of the Pentium 4, so who knows?)

If any of the answers involve a concept like oldest to choose among uops, how is it defined? Age since it was delivered to the RS? Age since it became ready? How are ties broken? Does program order ever come into it?

Since uops are inserted into the reservation station in order, oldest here does indeed refer to time it entered the reservation station, i.e. oldest in program order.

By the way, I would take those IACA results with a grain of salt as they may not reflect the nuances of the real hardware. On Haswell, there is a hardware counter called uops_executed_port that can tell you how many cycles in your thread were uops issues to ports 0-7. Maybe you could leverage these to get a better understanding of your program?

BeeOnRope

Here's what I found on Skylake, coming at it from the angle that uops are assigned to ports at issue time (i.e., when they are issued to the RS), not at dispatch time (i.e., at the moment they are sent to execute). Before I had understood that the port decision was made at dispatch time.

I did a variety of tests which tried to isolate sequences of add operations that can go to p0156 and imul operations which go only to port 0. A typical test goes something like this:

mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]

... many more mov instructions

mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]

imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]

... many more mov instructions

mov eax, [edi]
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]

Basically there is a long lead-in of mov eax, [edi] instructions, which only issue on p23 and hence don't clog up the ports used by the instructions (I could have also used nop instructions, but the test would be a bit different since nop don't issue to the RS). This is followed by the "payload" section, here composed of 4 imul and 12 add, and then a lead-out section of more dummy mov instructions.

First, let's take a look at the patent that hayesti linked above, and which he describes the basic idea about: counters for each port that track the total number of uops assigned to the port, which are used to load balance the port assignments. Take a look at this table included in the patent description:

This table is used to pick between p0 or p1 for the 3-uops in an issue group for the 3-wide architecture discussed in the patent. Note that the behavior depends on the position of the uop in the group, and that there are 4 rules1 based on the count, which spread the uops around in a logical way. In particular, the count needs to be at +/- 2 or greater before the whole group gets assigned the under-used port.

Let's see if we can observe the "position in issue group" matters behavior on Sklake. We use a payload of a single add like:

add edx, 1     ; position 0
mov eax, [edi]
mov eax, [edi]
mov eax, [edi]

... and we slide it around inside the 4 instruction chuck like:

mov eax, [edi]
add edx, 1      ; position 1
mov eax, [edi]
mov eax, [edi]

... and so on, testing all four positions within the issue group2. This shows the following, when the RS is full (of mov instructions) but with no port pressure of any of the relevant ports:

  • The first add instructions go to p5 or p6, with the port selected usually alternating as the instruction is slow down (i.e., add instructions in even positions go to p5 and in odd positions go to p6).
  • The second add instruction also goes to p56 - whichever of the two the first one didn't go to.
  • After that further add instructions start to be balanced around p0156, with p5 and p6 usually ahead but with things fairly even overall (i.e., the gap between p56 and the other two ports doesn't grow).

Next, I took a look at what happens if load up p1 with imul operations, then first in a bunch of add operations:

imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1
imul ebx, ebx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

add r9, 1
add r8, 1
add ecx, 1
add edx, 1

The results show that the scheduler handles this well - all of the imul got to scheduled to p1 (as expected), and then none of the subsequent add instructions went to p1, being spread around p056 instead. So here the scheduling is working well.

Of course, when the situation is reversed, and the series of imul comes after the adds, p1 is loaded up with its share of adds before the imuls hit. That's a result of port assignment happening in-order at issue time, since is no mechanism to "look ahead" and see the imul when scheduling the adds.

Overall the scheduler looks to do a good job in these test cases.

It doesn't explain what happens in smaller, tighter loops like the following:

sub r9, 1
sub r10, 1
imul ebx, edx, 1
dec ecx
jnz top

Just like Example 4 in my question, this loop only fills p0 on ~30% of cycles, despite there being two sub instructions that should be able to go to p0 on every cycle. p1 and p6 are oversubscribed, each executing 1.24 uops for every iteration (1 is ideal). I wasn't able to triangulate the difference between the examples that work well at the top of this answer with the bad loops - but there are still many ideas to try.

I did note that examples without instruction latency differences don't seem to suffer from this issue. For example, here's another 4-uop loop with "complex" port pressure:

top:
    sub r8, 1
    ror r11, 2
    bswap eax
    dec ecx
    jnz top

The uop map is as follows:

instr   p0 p1 p5 p6 
sub      X  X  X  X
ror      X        X
bswap       X  X   
dec/jnz           X

So the sub must always go to p15, shared with bswap if things are to work out. They do:

Performance counter stats for './sched-test2' (2 runs):

   999,709,142      uops_dispatched_port_port_0                                     ( +-  0.00% )
   999,675,324      uops_dispatched_port_port_1                                     ( +-  0.00% )
   999,772,564      uops_dispatched_port_port_5                                     ( +-  0.00% )
 1,000,991,020      uops_dispatched_port_port_6                                     ( +-  0.00% )
 4,000,238,468      uops_issued_any                                               ( +-  0.00% )
 5,000,000,117      instructions:u            #    4.99  insns per cycle          ( +-  0.00% )
 1,001,268,722      cycles:u                                                      ( +-  0.00% )

So it seems that the issue may be related to instruction latencies (certainly, there are other differences between the examples). That's something that came up in this similar question.


1The table has 5 rules, but the rule for 0 and -1 counts are identical.

2Of course, I can't be sure where the issue groups start and end, but regardless we test four different positions as we slide down four instructions (but the labels could be wrong). I'm also not sure the issue group max size is 4 - earlier parts of the pipeline are wider - but I believe it is and some testing seemed to show it was (loops with a multiple of 4 uops showed consistent scheduling behavior). In any case, the conclusions hold with different scheduling group sizes.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!