Negative clock cycle measurements with back-to-back rdtsc?

后端 未结 9 1235
情话喂你
情话喂你 2020-11-27 04:17

I am writing a C code for measuring the number of clock cycles needed to acquire a semaphore. I am using rdtsc, and before doing the measurement on the semaphore, I call rdt

9条回答
  •  长情又很酷
    2020-11-27 04:29

    The other answers are great (go read them), but assume that rdtsc being read correctly. This answer is addressing the inline-asm bug that leads to totally bogus results, including negative.

    The other possibility is that you were compiling this as 32-bit code, but with many more repeats, and got an occasional negative interval on CPU migration on a system that doesn't have invariant-TSC (synced TSCs across all cores). Either a multi-socket system, or an older multi-core. CPU TSC fetch operation especially in multicore-multi-processor environment.


    If you were compiling for x86-64, your negative results are fully explained by your incorrect "=A" output constraint for asm. See Get CPU cycle count? for correct ways to use rdtsc that are portable to all compilers and 32 vs. 64-bit mode. Or use "=a" and "=d" outputs and simply ignore the high half output, for short intervals that won't overflow 32 bits.)

    (I'm surprised you didn't mention them also being huge and wildly-varying, as well as overflowing tot to give a negative average even if no individual measurements were negative. I'm seeing averages like -63421899, or 69374170, or 115365476.)

    Compiling it with gcc -O3 -m32 makes it work as expected, printing averages of 24 to 26 (if run in a loop so the CPU stays at top speed, otherwise like 125 reference cycles for the 24 core clock cycles between back-to-back rdtsc on Skylake). https://agner.org/optimize/ for instruction tables.


    Asm details of what went wrong with the "=A" constraint

    rdtsc (insn ref manual entry) always produces the two 32-bit hi:lo halves of its 64-bit result in edx:eax, even in 64-bit mode where we're really rather have it in a single 64-bit register.

    You were expecting the "=A" output constraint to pick edx:eax for uint64_t t. But that's not what happens. For a variable that fits in one register, the compiler picks either RAX or RDX and assumes the other is unmodified, just like a "=r" constraint picks one register and assumes the rest are unmodified. Or an "=Q" constraint picks one of a,b,c, or d. (See x86 constraints).

    In x86-64, you'd normally only want "=A" for an unsigned __int128 operand, like a multiple result or div input. It's kind of a hack because using %0 in the asm template only expands to the low register, and there's no warning when "=A" doesn't use both a and d registers.

    To see exactly how this causes a problem, I added a comment inside the asm template:
    __asm__ volatile ("rdtsc # compiler picked %0" : "=A"(t));. So we can see what the compiler expects, based on what we told it with operands.

    The resulting loop (in Intel syntax) looks like this, from compiling a cleaned up version of your code on the Godbolt compiler explorer for 64-bit gcc and 32-bit clang:

    # the main loop from gcc -O3  targeting x86-64, my comments added
    .L6:
        rdtsc  # compiler picked rax     # c1 = rax
        rdtsc  # compiler picked rdx     # c2 = rdx, not realizing that rdtsc clobbers rax(c1)
    
          # compiler thinks   RAX=c1,               RDX=c2
          # actual situation: RAX=low half of c2,   RDX=high half of c2
    
        sub     edx, eax                 # tsccost = edx-eax
        js      .L3                      # jump if the sign-bit is set in tsccost
       ... rest of loop back to .L6
    

    When the compiler is calculating c2-c1, it's actually calculating hi-lo from the 2nd rdtsc, because we lied to the compiler about what the asm statement does. The 2nd rdtsc clobbered c1

    We told it that it had a choice of which register to get the output in, so it picked one register the first time, and the other the 2nd time, so it wouldn't need any mov instructions.

    The TSC counts reference cycles since the last reboot. But the code doesn't depend on hi, it just depends on the sign of hi-lo. Since lo wraps around every second or two (2^32 Hz is close to 4.3GHz), running the program at any given time has approximately a 50% chance of seeing a negative result.

    It doesn't depend on the current value of hi; there's maybe a 1 part in 2^32 bias in one direction or the other because hi changes by one when lo wraps around.

    Since hi-lo is a nearly uniformly distributed 32-bit integer, overflow of the average is very common. Your code is ok if the average is normally small. (But see other answers for why you don't want the mean; you want to median or something to exclude outliers.)

提交回复
热议问题