Are load ops deallocated from the RS when they dispatch, complete or some other time?

后端 未结 2 1498
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-01-14 02:43

On modern Intel1 x86, are load uops freed from the RS (Reservation Station) at the point they dispatch2, or when they complete3

相关标签:
2条回答
  • 2021-01-14 03:23

    Just came across this question. Here is my attempt at an answer.

    Short Answer: I'm still a bit uncertain about some parts but based on some measurements using various performance counters along with performance monitoring interrupts, it "looks like" the load uop gets removed from RS during the same cycle it is dispatched to load ports or at least very shortly afterwards.

    Details: A while ago I tried writing a kernel module which mimics the ideas here. The blog post linked describes the idea really well so I won't explain it in detail here. The main idea is to trigger a performance monitoring interrupt after a set number of cycles have elapsed, freeze all counter values (currently tracked), store them and reset/repeat. Doing this for 1, 2, ... n cycles gives us some picture of what is going on micro-architecturally at the cycle granularity. How accurate of a picture is a different story... The source for the kernel module I used for measuring can be found here.

    Long Answer: I profiled the following code below using the kernel module mentioned above on a i7-1065G7 (Ice Lake) and tracked 11 different performance counters. Prior to the mov instruction profiled, clflush was called on the address stored in r8. This was done so that the load would take long enough to make it easy to tell whether the uop was removed from RS before, after or during execution (otherwise the load completes in about 4 cycles). In total I measured up to 600 cycles with most of the events which are of interest in this question happening within 65 cycles. To account for noise I did 1024 trials for each cycle and stored the counter value which occurred the most. Luckily for each cycle in the chart below and each counter I only saw deviations in value from at most a single trial with the remaining 1023 trials giving the same counter values.

     563:   0f 30                   wrmsr  
     565:   4d 8b 00                mov    (%r8),%r8
     568:   0f ae f0                mfence 
     56b:   0f ae e8                lfence
    

    The counters tracked are listed below. Descriptions are summarized from Intel SDM.

      INST_RETIRED_ANY_P:          To track when wrmsr retired
      RS_EVENTS_EMPTY_CYCLES:      Count of cycles RS is empty
      UOPS_DISPATCHED_PORT_PORT_0: # uops dispatched to port 0
      UOPS_DISPATCHED_PORT_PORT_1: # uops dispatched to port 1 
      UOPS_DISPATCHED_PORT_2_3:    # uops dispatched to port 2,3 (load addr ports)
      UOPS_DISPATCHED_PORT_4_9:    # uops dispatched to port 4,9 (store data ports)
      UOPS_DISPATCHED_PORT_PORT_5: # uops dispatched to port 5
      UOPS_DISPATCHED_PORT_PORT_6: # uops dispatched to port 6
      UOPS_DISPATCHED_PORT_7_8:    # uops dispatched to port 7,8 (store addr ports)
      UOPS_EXECUTED_THREAD:        # uops executed
      UOPS_ISSUED_ANY:             # uops sent to RS from RAT
    

    The table below lists each counter value at each cycle. So based on the table below one uop is sent to RS at cycle 47 and occupies the RS for cycles 51-54. This is presumably the load uop. At cycle 54 RS_EVENTS_EMPTY_CYCLES and UOPS_DISPATCHED_PORT_2_3 increment which means (at least how I'm interpreting it) that the load uop has been dispatched and is freed from the RS.

    What I'm not sure about is that at cycle 52 three more uops are issued to the RS. They seem to arrive and occupy the RS for cycle 55-58. But only two uops are dispatched to the execution ports and the RS is emptied. Regardless by cycle 59 the RS is empty (count is increasing each cycle). The load completes and mov retires about 500 cycles later.

    +-------+--------------+-----------------+--------+--------+----------+----------+--------+--------+----------+---------------+-------------------+------------------------+
    | Cycle | Inst Retired | Cycles RS Empty | Port 0 | Port 1 | Port 2,3 | Port 4,9 | Port 5 | Port 6 | Port 7,8 | uops executed | uops issued to RS |        Comments        |
    +-------+--------------+-----------------+--------+--------+----------+----------+--------+--------+----------+---------------+-------------------+------------------------+
    |     1 |            0 |               3 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 0 |                        |
    |     2 |            0 |               4 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 0 |                        |
    |     3 |            0 |               5 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 0 |                        |
    |     4 |            0 |               6 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 | 2 uops issued          |
    |     5 |            0 |               7 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
    |     6 |            0 |               8 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
    |     7 |            0 |               9 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
    |     8 |            0 |              10 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
    |     9 |            0 |              11 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
    |    10 |            0 |              12 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
    |    11 |            0 |              12 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
    |    12 |            0 |              12 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
    |    13 |            0 |              12 |      0 |      0 |        0 |        0 |      0 |      0 |        0 |             3 |                 2 |                        |
    |    14 |            0 |              13 |      0 |      0 |        0 |        0 |      0 |      1 |        0 |             3 |                 2 |                        |
    |    15 |            0 |              14 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             3 |                 2 | 2 uops dispatched      |
    |    16 |            0 |              15 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             4 |                 2 |                        |
    |    17 |            0 |              16 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 | 2 uops executedd       |
    |    18 |            0 |              17 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 |                        |
    |    19 |            0 |              18 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 |                        |
    |    20 |            0 |              19 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 |                        |
    |    21 |            0 |              20 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 |                        |
    |    22 |            0 |              21 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 2 |                        |
    |    23 |            0 |              22 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 5 |                        |
    |    24 |            0 |              23 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 | 4 uops issued          |
    |    25 |            0 |              24 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
    |    26 |            0 |              25 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
    |    27 |            0 |              25 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
    |    28 |            0 |              25 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
    |    29 |            0 |              25 |      0 |      0 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
    |    30 |            0 |              25 |      0 |      1 |        0 |        0 |      0 |      2 |        0 |             5 |                 6 |                        |
    |    31 |            0 |              26 |      0 |      1 |        0 |        0 |      0 |      3 |        0 |             5 |                 6 |                        |
    |    32 |            0 |              27 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             6 |                 6 |                        |
    |    33 |            0 |              28 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             7 |                 6 |                        |
    |    34 |            0 |              29 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 | 3 uops executed        |
    |    35 |            0 |              30 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
    |    36 |            1 |              31 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 | wrmsr retired          |
    |    37 |            1 |              32 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
    |    38 |            1 |              33 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
    |    39 |            1 |              34 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
    |    40 |            1 |              35 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
    |    41 |            1 |              36 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
    |    42 |            1 |              37 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
    |    43 |            1 |              38 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
    |    44 |            1 |              39 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
    |    45 |            1 |              40 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
    |    46 |            1 |              41 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
    |    47 |            1 |              42 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 6 |                        |
    |    48 |            1 |              43 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 7 | 1 uop issued           |
    |    49 |            1 |              44 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 7 |                        |
    |    50 |            1 |              45 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 7 |                        |
    |    51 |            1 |              46 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                 7 |                        |
    |    52 |            1 |              46 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                10 | 3 uops issued          |
    |    53 |            1 |              46 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                10 |                        |
    |    54 |            1 |              46 |      0 |      1 |        0 |        0 |      0 |      4 |        0 |             8 |                10 | port 2,3 load addr     |
    |    55 |            1 |              47 |      0 |      1 |        1 |        0 |      0 |      4 |        0 |             8 |                10 |                        |
    |    56 |            1 |              47 |      0 |      1 |        1 |        0 |      0 |      4 |        0 |             8 |                10 | executing load         |
    |    57 |            1 |              47 |      0 |      1 |        1 |        0 |      0 |      4 |        0 |             9 |                10 |                        |
    |    58 |            1 |              47 |      0 |      1 |        1 |        0 |      0 |      4 |        0 |             9 |                10 | port 4,9 store data    |
    |    59 |            1 |              48 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |             9 |                10 | port 7,8 store address |
    |    60 |            1 |              49 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |             9 |                10 |                        |
    |    61 |            1 |              50 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |            11 |                10 | 2 uops executed        |
    |    62 |            1 |              51 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |            11 |                10 |                        |
    |    63 |            1 |              52 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |            11 |                10 |                        |
    |    64 |            1 |              53 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |            11 |                10 |                        |
    |    65 |            1 |              54 |      0 |      1 |        1 |        1 |      0 |      4 |        1 |            11 |                10 |                        |
    +-------+--------------+-----------------+--------+--------+----------+----------+--------+--------+----------+---------------+-------------------+------------------------+
    
    

    So based on the table it looks like the load uop is removed from the RS either at the same time as dispatching to load port or a couple of cycles later. I did some sanity checking of the values in the chart and for the most part all the counter values makes sense. Two things I haven't figure out is the fact that 4 uops are to be sent to RS (cycle 24) but only 3 gets executed (cycle 35). Similarly 3 uops is issued at cycle 52, but only 2 are executed (cycle 61)

    Thanks

    0 讨论(0)
  • 2021-01-14 03:25

    The following experiments suggest that the uops are deallocated at some point before the load completes. While this is not a complete answer to your question, it might provide some interesting insights.

    On Skylake, there is a 33-entry reservation station for loads (see https://stackoverflow.com/a/58575898/10461973). This should also be the case for the Coffee Lake i7-8700K, which is used for the following experiments.

    We assume that R14 contains a valid memory address.

    clflush [R14]
    clflush [R14+512]
    mfence
    
    # start measuring cycles
    
    mov RAX, [R14]
    mov RAX, [R14]
    ...
    mov RAX, [R14]
    
    mov RBX, [R14+512]
    
    # stop measuring cycles
    
    

    mov RAX, [R14] is unrolled 35 times. A load from memory takes at least about 280 cycles on this system. If the load uops stayed in the 33-entry reservation station until completion, the last load could only start after more than 280 cycles and would need another ~280cycles. However, the total measured time for this experiment is only about 340 cycles. This indicates that the load uops leave the RS at some time before completion.

    In contrast, the following experiments shows a case where most uops are forced to stay in the reservation until the first load completes:

    mov RAX, R14
    mov [RAX], RAX
    clflush [R14]
    clflush [R14+512]
    mfence
    
    # start measuring cycles
    
    mov RAX, [RAX]
    mov RAX, [RAX]
    ...
    mov RAX, [RAX]
    
    mov RBX, [R14+512]
    
    # stop measuring cycles
    
    

    The first 35 loads now have dependencies on each other. The measured time for this experiment is about 600 cycles.

    The experiments were performed with all but one core disabled, and with the CPU governor set to performance (cpupower frequency-set --governor performance).

    Here are the nanoBench commands I used:

    ./nanoBench.sh -unroll 1 -basic -asm_init "clflush [R14]; clflush [R14+512]; mfence" -asm "mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RAX, [R14]; mov RBX, [R14+512]"

    ./nanoBench.sh -unroll 1 -basic -asm_init "mov RAX, R14; mov [RAX], RAX; clflush [R14]; clflush [R14+512]; mfence" -asm "mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RAX, [RAX]; mov RBX, [R14+512]"

    0 讨论(0)
提交回复
热议问题