One thread counting, other thread does a job and measurement

问题

I would like to implement a 2 thread model where 1 is counting (infinitely increment a value) and the other one is recording the first counter, do the job, record the second recording and measure the time elapsed between.

Here is what I have done so far:

// global counter
register unsigned long counter asm("r13");
// unsigned long counter;

void* counter_thread(){
    // affinity is set to some isolated CPU so the noise will be minimal

    while(1){
        //counter++; // Line 1*
        asm volatile("add $1, %0" : "+r"(counter) : ); // Line 2*
    }
}

void* measurement_thread(){
    // affinity is set somewhere over here
    unsigned long meas = 0;
    unsigned long a = 5;
    unsigned long r1,r2;
    sleep(1.0);
    while(1){
        mfence();
        r1 = counter;
        a *=3; // dummy operation that I want to measure
        r2 = counter;
        mfence();
        meas = r2-r1;
        printf("counter:%ld \n", counter);
        break;
    }
}

Let me explain what I have done so far:

Since I want the counter to be accurate, I am setting the affinity to an isolated CPU. Also, If I use the counter in Line 1*, the dissassambled function will be:

 d4c:   4c 89 e8                mov    %r13,%rax
 d4f:   48 83 c0 01             add    $0x1,%rax
 d53:   49 89 c5                mov    %rax,%r13
 d56:   eb f4                   jmp    d4c <counter_thread+0x37>

Which is not 1 cycle operation. That is why I have used inline assembly to decrease 2 mov instructions. Using the inline assembly:

 d4c:   49 83 c5 01             add    $0x1,%r13
 d50:   eb fa                   jmp    d4c <counter_thread+0x37>

But the thing is, both implementations are not working. The other thread cannot see the counter being updated. If I make the global counter value not a register, then it is working, but I want to be precise. If I make global counter value to unsigned long counter then the disassembled code of counter thread is:

 d4c:   48 8b 05 ed 12 20 00    mov    0x2012ed(%rip),%rax        # 202040 <counter>
 d53:   48 83 c0 01             add    $0x1,%rax
 d57:   48 89 05 e2 12 20 00    mov    %rax,0x2012e2(%rip)        # 202040 <counter>
 d5e:   eb ec                   jmp    d4c <counter_thread+0x37>

It works but it doesn't give me the granularity that I want.

EDIT:

My environment:

CPU: AMD Ryzen 3600
kernel: 5.0.0-32-generic
OS: Ubuntu 18.04

EDIT2: I have isolated 2 neighbor CPU cores (i.e. core 10 and 11) and running the experiment on those cores. The counter is on one of the cores, measurement is on the other. Isolation is done by using /etc/default/grub file and adding isolcpus line.

EDIT3: I know that one measurement is not enough. I have run the experiment 10 million times and looked at the results.

Experiment1: Setup:

unsigned long counter =0;//global counter 
void* counter_thread(){
    mfence();
    while(1)
        counter++;
}
void* measurement_thread(){
    unsigned long i=0, r1=0,r2=0;
    unsigned int a=0;
    sleep(1.0);
    while(1){
        mfence();
        r1 = counter;
        a +=3;
        r2 = counter;
        mfence();
        measurements[r2-r1]++;
        i++;
        if(i == MILLION_ITER)
            break;   
    }
}

Results1: In 99.99% I got 0. Which I expect because either first thread is not running, or OS or other interrupts disturb the measurement. Getting rid of the 0's and very high values gives me 20 cycles of measurement on the average. (I was expecting 3-4 because I only do an integer addition).

Experiment2:

Setup: Identically the same as above, one difference is, instead of global counter, I use the counter as register:

register unsigned long counter asm("r13");

Results2: Measurement thread always reads 0. In disassembled code, I can see that both are dealing with R13 register (counter), however, I believe that it is not somehow shared.

Experiment3:

Setup: Identical to the setup2, except in the counter thread, instead of doing counter++, I am doing an inline assembly to make sure that I am doing 1 cycle operation. My disassembled file looks like this:

 cd1:   49 83 c5 01             add    $0x1,%r13
 cd5:   eb fa                   jmp    cd1 <counter_thread+0x37>

Results3: Measurement thread reads 0 as above.

回答1:

Each thread has its own registers. Each logical CPU core has its own architectural registers which a thread uses when running on a core. Only signal handlers (or on bare metal, interrupts) can modify the registers of their thread.

Declaring a GNU C asm register-global like your ... asm("r13") in a multi-threaded program effectively gives you thread-local storage, not a truly shared global.

Only memory is shared between threads, not registers. This is how multiple threads can run at the same time without stepping on each other, each using their registers.

Registers that you don't declare as register-global can be used freely by the compiler, so it wouldn't work at all for them to be shared between cores. (And there's nothing GCC can do to make them shared vs. private depending on how you declare them.)

Even apart from that, the register global isn't volatile or atomic so r1 = counter; and r2 = counter; can CSE so r2-r1 is a compile-time-constant zero even if your local R13 was changing from a signal handler.

How can I make sure that both of the threads are using registers for read/write operation of the counter value?

You can't do that. There is no shared state between cores that can be read/written with lower latency than cache.

If you want to time something, consider using rdtsc to get reference cycles, or rdpmc to read a performance counter (which you might have set up to be counting core clock cycles).

Using another thread to increment a counter is unnecessary, and not helpful because there's no very-low-overhead way to read something from another core.

The rdtscp instruction in my machine gives 36-72-108... cycle resolution at best. So, I cannot distinguish the difference between 2 cycles and 35 cycles because both of them will give 36 cycles.

Then you're using rdtsc wrong. It's not serializing so you need lfence around the timed region. See my answer on How to get the CPU cycle count in x86_64 from C++?. But yes, rdtsc is expensive, and rdpmc is only somewhat lower overhead.

But more importantly, you can't usefully measure a *=3; in C in terms of a single cost in cycles. First of all, it can compile differently depending on context.

But assuming a normal lea eax, [rax + rax*2], a realistic instruction cost model has 3 dimensions: uop count (front end), back-end port pressure, and latency from input(s) to output. https://agner.org/optimize/

See my answer on RDTSCP in NASM always returns the same value for more about timing a single instruction. Put it in a loop in different ways to measure throughput vs. latency, and look at perf counters to get uops->ports. Or look at Agner Fog's instruction tables and https://uops.info/ because people have already done those test.

Also

How many CPU cycles are needed for each assembly instruction?
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
Modern x86 cost model

Again, these are how you time a single asm instruction, not a C statement. With optimization enabled the cost of a C statement can depend on how it optimizes into the surrounding code. (And/or whether latency of surrounding operations hides its cost, on an out-of-order execution CPU like all modern x86 CPUs.)

回答2:

Then you're using rdtsc wrong. It's not serializing so you need lfence around the timed region. See my answer on How to get the CPU cycle count in x86_64 from C++?. But yes, rdtsc is expensive, and rdpmc is only somewhat lower overhead.

Ok. I did my homework.

First things first. I knew that rdtscp is serialized instruction. I am not talking about rdtsc, there is a P letter at the end.

I have checked both Intel and AMD manuals for that.

Intel manual page, page 83, Table 2-3. Summary of System Instructions,
AMD manual page 403-406

Correct me if I am wrong but, from what I read, I understand that I don't need fence instructions before and after rdtscp, because it is a serializing instruction, right?

Second thing is, I did run some experiments on 3 of my machines. Here are the results

Ryzen experiments

======================= AMD RYZEN EXPERIMENTS =========================
RYZEN 3600
100_000 iteration
Using a *=3
Not that, almost all sums are divisible by 36, which is my machine's timer resolution. 
I also checked where the sums are not divisible by 36. 
This is the case where I don't use fence instructions with rdtsc. 
It turns out that the read value is either 35, or 1, 
which I believe the instruction(rdtsc) cannot read the value correctly.

Mfenced rtdscP reads:
    Sum:            25884432
    Avg:            258
    Sum, removed outliers:  25800120
    Avg, removed outliers:  258
Mfenced rtdsc reads:
    Sum:            17579196
    Avg:            175
    Sum, removed outliers:  17577684
    Avg, removed outliers:  175
Lfenced rtdscP reads:
    Sum:            7511688
    Avg:            75
    Sum, removed outliers:  7501608
    Avg, removed outliers:  75
Lfenced rtdsc reads:
    Sum:            7024428
    Avg:            70
    Sum, removed outliers:  7015248
    Avg, removed outliers:  70
NOT fenced rtdscP reads:
    Sum:            6024888
    Avg:            60
    Sum, removed outliers:  6024888
    Avg, removed outliers:  60
NOT fenced rtdsc reads:
    Sum:            3274866
    Avg:            32
    Sum, removed outliers:  3232913
    Avg, removed outliers:  35

======================================================
Using 3 dependent floating point divisions:

div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;

Mfenced rtdscP reads:
    Sum:            36217404
    Avg:            362
    Sum, removed outliers:  36097164
    Avg, removed outliers:  361
Mfenced rtdsc reads:
    Sum:            22973400
    Avg:            229
    Sum, removed outliers:  22939236
    Avg, removed outliers:  229
Lfenced rtdscP reads:
    Sum:            13178196
    Avg:            131
    Sum, removed outliers:  13177872
    Avg, removed outliers:  131
Lfenced rtdsc reads:
    Sum:            12631932
    Avg:            126
    Sum, removed outliers:  12631932
    Avg, removed outliers:  126
NOT fenced rtdscP reads:
    Sum:            12115548
    Avg:            121
    Sum, removed outliers:  12103236
    Avg, removed outliers:  121
NOT fenced rtdsc reads:
    Sum:            3335997
    Avg:            33
    Sum, removed outliers:  3305333
    Avg, removed outliers:  35

=================== END OF AMD RYZEN EXPERIMENTS =========================

And here is the bulldozer architecture experiments.

======================= AMD BULLDOZER EXPERIMENTS =========================
AMD A6-4455M
100_000 iteration
Using a *=3;

Mfenced rtdscP reads:
    Sum:            32120355
    Avg:            321
    Sum, removed outliers:  27718117
    Avg, removed outliers:  278
Mfenced rtdsc reads:
    Sum:            23739715
    Avg:            237
    Sum, removed outliers:  23013028
    Avg, removed outliers:  230
Lfenced rtdscP reads:
    Sum:            14274916
    Avg:            142
    Sum, removed outliers:  13026199
    Avg, removed outliers:  131
Lfenced rtdsc reads:
    Sum:            11083963
    Avg:            110
    Sum, removed outliers:  10905271
    Avg, removed outliers:  109
NOT fenced rtdscP reads:
    Sum:            9361738
    Avg:            93
    Sum, removed outliers:  8993886
    Avg, removed outliers:  90
NOT fenced rtdsc reads:
    Sum:            4766349
    Avg:            47
    Sum, removed outliers:  4310312
    Avg, removed outliers:  43


=================================================================
Using 3 dependent floating point divisions:

div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;

Mfenced rtdscP reads:
    Sum:            38748536
    Avg:            387
    Sum, removed outliers:  36719312
    Avg, removed outliers:  368
Mfenced rtdsc reads:
    Sum:            35106459
    Avg:            351
    Sum, removed outliers:  33514331
    Avg, removed outliers:  335
Lfenced rtdscP reads:
    Sum:            23867349
    Avg:            238
    Sum, removed outliers:  23203849
    Avg, removed outliers:  232
Lfenced rtdsc reads:
    Sum:            21991975
    Avg:            219
    Sum, removed outliers:  21394828
    Avg, removed outliers:  215
NOT fenced rtdscP reads:
    Sum:            19790942
    Avg:            197
    Sum, removed outliers:  19701909
    Avg, removed outliers:  197
NOT fenced rtdsc reads:
    Sum:            10841074
    Avg:            108
    Sum, removed outliers:  10583085
    Avg, removed outliers:  106

=================== END OF AMD BULLDOZER EXPERIMENTS =========================

Intel results are:

======================= INTEL EXPERIMENTS =========================
INTEL 4710HQ
100_000 iteration

Using a *=3

Mfenced rtdscP reads:
    Sum:            10914893
    Avg:            109
    Sum, removed outliers:  10820879
    Avg, removed outliers:  108
Mfenced rtdsc reads:
    Sum:            7866322
    Avg:            78
    Sum, removed outliers:  7606613
    Avg, removed outliers:  76
Lfenced rtdscP reads:
    Sum:            4823705
    Avg:            48
    Sum, removed outliers:  4783842
    Avg, removed outliers:  47
Lfenced rtdsc reads:
    Sum:            3634106
    Avg:            36
    Sum, removed outliers:  3463079
    Avg, removed outliers:  34
NOT fenced rtdscP reads:
    Sum:            2216884
    Avg:            22
    Sum, removed outliers:  1435830
    Avg, removed outliers:  17
NOT fenced rtdsc reads:
    Sum:            1736640
    Avg:            17
    Sum, removed outliers:  986250
    Avg, removed outliers:  12

===================================================================
Using 3 dependent floating point divisions:

div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;

Mfenced rtdscP reads:
    Sum:            22008705
    Avg:            220
    Sum, removed outliers:  16097871
    Avg, removed outliers:  177
Mfenced rtdsc reads:
    Sum:            13086713
    Avg:            130
    Sum, removed outliers:  12627094
    Avg, removed outliers:  126
Lfenced rtdscP reads:
    Sum:            9882409
    Avg:            98
    Sum, removed outliers:  9753927
    Avg, removed outliers:  97
Lfenced rtdsc reads:
    Sum:            8854943
    Avg:            88
    Sum, removed outliers:  8435847
    Avg, removed outliers:  84
NOT fenced rtdscP reads:
    Sum:            7302577
    Avg:            73
    Sum, removed outliers:  7190424
    Avg, removed outliers:  71
NOT fenced rtdsc reads:
    Sum:            1726126
    Avg:            17
    Sum, removed outliers:  1029630
    Avg, removed outliers:  12

=================== END OF INTEL EXPERIMENTS =========================

From my point of view, AMD Ryzen should've executed faster. My Intel CPU is almost 5 years old and the AMD CPU is brand new.

I couldn't find the exact source, but, I have read that AMD changed/ decreased the resolution of rdtsc and rdtscp instruction while they are updating the architecture from Bulldozer to Ryzen. That is why I get multiple of 36 results when I try to measure the timing of the code. I don't know why they did or where did I find the information, but it is the case. If you have a AMD ryzen machine, I would suggest you to run the experiments and see the timer outputs.

I didn't look at rdpmc yet, I'll try to update when I read it.

EDIT:

Following up to the comments below.

About warming up: All experiments are just 1 C code. So, even if they are not warmed up in mfenced rdtscp (the first experiment), they surely are warmed up later.

I am using c and inline assembly mixed. I just use gcc main.c -o main to compile the code. AFAIK, it compiles using O0 optimization. gcc is version 7.4.0

Even to decrease the time, I declared my function as #define so that they won't be called from the function, which means faster execution.

An example code for how I did the experiments:

#define lfence() asm volatile("lfence\n");
#define mfence() asm volatile("mfence\n");
// reading the low end is enough for the measurement because I don't measure too complex result. 
// For complex measurements, I need to shift and OR
#define rdtscp(_readval) asm volatile("rdtscp\n": "=a"(_readval)::"rcx", "rdx");
void rdtscp_doublemfence(){
    uint64_t scores[MEASUREMENT_ITERATION] = {0};
    printf("Mfenced rtdscP reads:\n");
    initvars();
    for(int i = 0; i < MEASUREMENT_ITERATION; i++){
        mfence();
        rdtscp(read1);
        mfence();
        calculation_to_measure();
        mfence();
        rdtscp(read2);
        mfence();
        scores[i] = read2-read1;
        initvars();
    }
    calculate_sum_avg(scores);
}

EDIT2:

Why are you using mfence?

I wasn't using mfence at the first place. I was just using rdtscp, do work, rdtscp again to find the difference.

No idea what you're hoping to learn here by cycle-accurate timing of anti-optimized gcc -O0 output.

I am not using any optimization because I want to measure how many cycles would take instruction to be finished. I will measure code block which includes branches. If I use optimization, the optimization might change it to condmove and that would break the whole point of the measurement.

I wouldn't be surprised if the non-inline function call and other memory access (from disabling optimization, /facepalm) being mfenced is what makes it a multiple of 36 on your Ryzen.

Also, below, it is the disassembled version of the code. During the measurements, there is no memory access(except read1 and read2, which I believe they are in the cache) or call to other functions.

 9fd:   0f ae f0                mfence 
 a00:   0f 01 f9                rdtscp 
 a03:   48 89 05 36 16 20 00    mov    %rax,0x201636(%rip)        # 202040 <read1>
 a0a:   0f ae f0                mfence 
 a0d:   8b 05 15 16 20 00       mov    0x201615(%rip),%eax        # 202028 <a21>
 a13:   83 c0 03                add    $0x3,%eax #Either this or division operations for measurement
 a16:   89 05 0c 16 20 00       mov    %eax,0x20160c(%rip)        # 202028 <a21>
 a1c:   0f ae f0                mfence 
 a1f:   0f 01 f9                rdtscp 
 a22:   48 89 05 0f 16 20 00    mov    %rax,0x20160f(%rip)        # 202038 <read2>
 a29:   0f ae f0                mfence 
 a2c:   48 8b 15 05 16 20 00    mov    0x201605(%rip),%rdx        # 202038 <read2>
 a33:   48 8b 05 06 16 20 00    mov    0x201606(%rip),%rax        # 202040 <read1>
 a3a:   48 29 c2                sub    %rax,%rdx
 a3d:   8b 85 ec ca f3 ff       mov    -0xc3514(%rbp),%eax

回答3:

The code:

register unsigned long a21 asm("r13");

#define calculation_to_measure(){\
    a21 +=3;\
}
#define initvars(){\
    read1 = 0;\
    read2 = 0;\
    a21= 21;\
}
// =========== RDTSCP, double mfence ================
// Reference code, others are similar
void rdtscp_doublemfence(){
    uint64_t scores[MEASUREMENT_ITERATION] = {0};
    printf("Mfenced rtdscP reads:\n");
    initvars();
    for(int i = 0; i < MEASUREMENT_ITERATION; i++){
        mfence();
        rdtscp(read1);
        mfence();
        calculation_to_measure();
        mfence();
        rdtscp(read2);
        mfence();
        scores[i] = read2-read1;
        initvars();
    }
    calculate_sum_avg(scores);
}

Results, I only did those in AMD Ryzen machine.|

Using gcc main.c -O0 -o rdtsc , no optimization. It moves r13 to rax.

Dissassembled code:

 9ac:   0f ae f0                mfence 
 9af:   0f 01 f9                rdtscp 
 9b2:   48 89 05 7f 16 20 00    mov    %rax,0x20167f(%rip)        # 202038 <read1>
 9b9:   0f ae f0                mfence 
 9bc:   4c 89 e8                mov    %r13,%rax
 9bf:   48 83 c0 03             add    $0x3,%rax
 9c3:   49 89 c5                mov    %rax,%r13
 9c6:   0f ae f0                mfence 
 9c9:   0f 01 f9                rdtscp 
 9cc:   48 89 05 5d 16 20 00    mov    %rax,0x20165d(%rip)        # 202030 <read2>
 9d3:   0f ae f0                mfence

Results:

Mfenced rtdscP reads:
    Sum:            32846796
    Avg:            328
    Sum, removed outliers:  32626008
    Avg, removed outliers:  327
Mfenced rtdsc reads:
    Sum:            18235980
    Avg:            182
    Sum, removed outliers:  18108180
    Avg, removed outliers:  181
Lfenced rtdscP reads:
    Sum:            14351508
    Avg:            143
    Sum, removed outliers:  14238432
    Avg, removed outliers:  142
Lfenced rtdsc reads:
    Sum:            11179368
    Avg:            111
    Sum, removed outliers:  10994400
    Avg, removed outliers:  115
NOT fenced rtdscP reads:
    Sum:            6064488
    Avg:            60
    Sum, removed outliers:  6064488
    Avg, removed outliers:  60
NOT fenced rtdsc reads:
    Sum:            3306394
    Avg:            33
    Sum, removed outliers:  3278450
    Avg, removed outliers:  35

Using gcc main.c -Og -o rdtsc_global

Dissassembled code:

 934:   0f ae f0                mfence 
 937:   0f 01 f9                rdtscp 
 93a:   48 89 05 f7 16 20 00    mov    %rax,0x2016f7(%rip)        # 202038 <read1>
 941:   0f ae f0                mfence 
 944:   49 83 c5 03             add    $0x3,%r13
 948:   0f ae f0                mfence 
 94b:   0f 01 f9                rdtscp 
 94e:   48 89 05 db 16 20 00    mov    %rax,0x2016db(%rip)        # 202030 <read2>
 955:   0f ae f0                mfence

Results:

Mfenced rtdscP reads:
    Sum:            22819428
    Avg:            228
    Sum, removed outliers:  22796064
    Avg, removed outliers:  227
Mfenced rtdsc reads:
    Sum:            20630736
    Avg:            206
    Sum, removed outliers:  19937664
    Avg, removed outliers:  199
Lfenced rtdscP reads:
    Sum:            13375008
    Avg:            133
    Sum, removed outliers:  13374144
    Avg, removed outliers:  133
Lfenced rtdsc reads:
    Sum:            9840312
    Avg:            98
    Sum, removed outliers:  9774036
    Avg, removed outliers:  97
NOT fenced rtdscP reads:
    Sum:            8784684
    Avg:            87
    Sum, removed outliers:  8779932
    Avg, removed outliers:  87
NOT fenced rtdsc reads:
    Sum:            3274209
    Avg:            32
    Sum, removed outliers:  3255480
    Avg, removed outliers:  36

Using o1 optimization: gcc main.c -O1 -o rdtsc_o1

Dissassembled code:

 a89:   0f ae f0                mfence 
 a8c:   0f 31                   rdtsc  
 a8e:   48 89 05 a3 15 20 00    mov    %rax,0x2015a3(%rip)        # 202038 <read1>
 a95:   0f ae f0                mfence 
 a98:   49 83 c5 03             add    $0x3,%r13
 a9c:   0f ae f0                mfence 
 a9f:   0f 31                   rdtsc  
 aa1:   48 89 05 88 15 20 00    mov    %rax,0x201588(%rip)        # 202030 <read2>
 aa8:   0f ae f0                mfence

Results:

Mfenced rtdscP reads:
    Sum:            28041804
    Avg:            280
    Sum, removed outliers:  27724464
    Avg, removed outliers:  277
Mfenced rtdsc reads:
    Sum:            17936460
    Avg:            179
    Sum, removed outliers:  17931024
    Avg, removed outliers:  179
Lfenced rtdscP reads:
    Sum:            7110144
    Avg:            71
    Sum, removed outliers:  7110144
    Avg, removed outliers:  71
Lfenced rtdsc reads:
    Sum:            6691140
    Avg:            66
    Sum, removed outliers:  6672924
    Avg, removed outliers:  66
NOT fenced rtdscP reads:
    Sum:            5970888
    Avg:            59
    Sum, removed outliers:  5965236
    Avg, removed outliers:  59
NOT fenced rtdsc reads:
    Sum:            3402920
    Avg:            34
    Sum, removed outliers:  3280111
    Avg, removed outliers:  35

来源：https://stackoverflow.com/questions/58802323/one-thread-counting-other-thread-does-a-job-and-measurement

标签

multithreading

gcc

inline-assembly

cpu-registers