问题
I would like to implement a 2 thread model where 1 is counting (infinitely increment a value) and the other one is recording the first counter, do the job, record the second recording and measure the time elapsed between.
Here is what I have done so far:
// global counter
register unsigned long counter asm("r13");
// unsigned long counter;
void* counter_thread(){
// affinity is set to some isolated CPU so the noise will be minimal
while(1){
//counter++; // Line 1*
asm volatile("add $1, %0" : "+r"(counter) : ); // Line 2*
}
}
void* measurement_thread(){
// affinity is set somewhere over here
unsigned long meas = 0;
unsigned long a = 5;
unsigned long r1,r2;
sleep(1.0);
while(1){
mfence();
r1 = counter;
a *=3; // dummy operation that I want to measure
r2 = counter;
mfence();
meas = r2-r1;
printf("counter:%ld \n", counter);
break;
}
}
Let me explain what I have done so far:
Since I want the counter to be accurate, I am setting the affinity to an isolated CPU. Also, If I use the counter in Line 1*, the dissassambled function will be:
d4c: 4c 89 e8 mov %r13,%rax
d4f: 48 83 c0 01 add $0x1,%rax
d53: 49 89 c5 mov %rax,%r13
d56: eb f4 jmp d4c <counter_thread+0x37>
Which is not 1 cycle operation. That is why I have used inline assembly to decrease 2 mov instructions. Using the inline assembly:
d4c: 49 83 c5 01 add $0x1,%r13
d50: eb fa jmp d4c <counter_thread+0x37>
But the thing is, both implementations are not working. The other thread cannot see the counter being updated. If I make the global counter value not a register, then it is working, but I want to be precise. If I make global counter value to unsigned long counter
then the disassembled code of counter thread is:
d4c: 48 8b 05 ed 12 20 00 mov 0x2012ed(%rip),%rax # 202040 <counter>
d53: 48 83 c0 01 add $0x1,%rax
d57: 48 89 05 e2 12 20 00 mov %rax,0x2012e2(%rip) # 202040 <counter>
d5e: eb ec jmp d4c <counter_thread+0x37>
It works but it doesn't give me the granularity that I want.
EDIT:
My environment:
- CPU: AMD Ryzen 3600
- kernel: 5.0.0-32-generic
- OS: Ubuntu 18.04
EDIT2: I have isolated 2 neighbor CPU cores (i.e. core 10 and 11) and running the experiment on those cores. The counter is on one of the cores, measurement is on the other. Isolation is done by using /etc/default/grub file and adding isolcpus line.
EDIT3: I know that one measurement is not enough. I have run the experiment 10 million times and looked at the results.
Experiment1: Setup:
unsigned long counter =0;//global counter
void* counter_thread(){
mfence();
while(1)
counter++;
}
void* measurement_thread(){
unsigned long i=0, r1=0,r2=0;
unsigned int a=0;
sleep(1.0);
while(1){
mfence();
r1 = counter;
a +=3;
r2 = counter;
mfence();
measurements[r2-r1]++;
i++;
if(i == MILLION_ITER)
break;
}
}
Results1: In 99.99% I got 0. Which I expect because either first thread is not running, or OS or other interrupts disturb the measurement. Getting rid of the 0's and very high values gives me 20 cycles of measurement on the average. (I was expecting 3-4 because I only do an integer addition).
Experiment2:
Setup: Identically the same as above, one difference is, instead of global counter, I use the counter as register:
register unsigned long counter asm("r13");
Results2: Measurement thread always reads 0. In disassembled code, I can see that both are dealing with R13 register (counter), however, I believe that it is not somehow shared.
Experiment3:
Setup: Identical to the setup2, except in the counter thread, instead of doing counter++, I am doing an inline assembly to make sure that I am doing 1 cycle operation. My disassembled file looks like this:
cd1: 49 83 c5 01 add $0x1,%r13
cd5: eb fa jmp cd1 <counter_thread+0x37>
Results3: Measurement thread reads 0 as above.
回答1:
Each thread has its own registers. Each logical CPU core has its own architectural registers which a thread uses when running on a core. Only signal handlers (or on bare metal, interrupts) can modify the registers of their thread.
Declaring a GNU C asm register-global like your ... asm("r13")
in a multi-threaded program effectively gives you thread-local storage, not a truly shared global.
Only memory is shared between threads, not registers. This is how multiple threads can run at the same time without stepping on each other, each using their registers.
Registers that you don't declare as register-global can be used freely by the compiler, so it wouldn't work at all for them to be shared between cores. (And there's nothing GCC can do to make them shared vs. private depending on how you declare them.)
Even apart from that, the register global isn't volatile
or atomic
so r1 = counter;
and r2 = counter;
can CSE so r2-r1
is a compile-time-constant zero even if your local R13 was changing from a signal handler.
How can I make sure that both of the threads are using registers for read/write operation of the counter value?
You can't do that. There is no shared state between cores that can be read/written with lower latency than cache.
If you want to time something, consider using rdtsc to get reference cycles, or rdpmc
to read a performance counter (which you might have set up to be counting core clock cycles).
Using another thread to increment a counter is unnecessary, and not helpful because there's no very-low-overhead way to read something from another core.
The rdtscp instruction in my machine gives 36-72-108... cycle resolution at best. So, I cannot distinguish the difference between 2 cycles and 35 cycles because both of them will give 36 cycles.
Then you're using rdtsc
wrong. It's not serializing so you need lfence
around the timed region. See my answer on How to get the CPU cycle count in x86_64 from C++?. But yes, rdtsc
is expensive, and rdpmc
is only somewhat lower overhead.
But more importantly, you can't usefully measure a *=3;
in C in terms of a single cost in cycles. First of all, it can compile differently depending on context.
But assuming a normal lea eax, [rax + rax*2]
, a realistic instruction cost model has 3 dimensions: uop count (front end), back-end port pressure, and latency from input(s) to output. https://agner.org/optimize/
See my answer on RDTSCP in NASM always returns the same value for more about timing a single instruction. Put it in a loop in different ways to measure throughput vs. latency, and look at perf counters to get uops->ports. Or look at Agner Fog's instruction tables and https://uops.info/ because people have already done those test.
Also
- How many CPU cycles are needed for each assembly instruction?
- What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
- Modern x86 cost model
Again, these are how you time a single asm instruction, not a C statement. With optimization enabled the cost of a C statement can depend on how it optimizes into the surrounding code. (And/or whether latency of surrounding operations hides its cost, on an out-of-order execution CPU like all modern x86 CPUs.)
回答2:
Then you're using rdtsc wrong. It's not serializing so you need lfence around the timed region. See my answer on How to get the CPU cycle count in x86_64 from C++?. But yes, rdtsc is expensive, and rdpmc is only somewhat lower overhead.
Ok. I did my homework.
First things first. I knew that rdtscp
is serialized instruction. I am not talking about rdtsc
, there is a P
letter at the end.
I have checked both Intel and AMD manuals for that.
- Intel manual page, page 83, Table 2-3. Summary of System Instructions,
- AMD manual page 403-406
Correct me if I am wrong but, from what I read, I understand that I don't need fence
instructions before and after rdtscp
, because it is a serializing instruction, right?
Second thing is, I did run some experiments on 3 of my machines. Here are the results
Ryzen experiments
======================= AMD RYZEN EXPERIMENTS =========================
RYZEN 3600
100_000 iteration
Using a *=3
Not that, almost all sums are divisible by 36, which is my machine's timer resolution.
I also checked where the sums are not divisible by 36.
This is the case where I don't use fence instructions with rdtsc.
It turns out that the read value is either 35, or 1,
which I believe the instruction(rdtsc) cannot read the value correctly.
Mfenced rtdscP reads:
Sum: 25884432
Avg: 258
Sum, removed outliers: 25800120
Avg, removed outliers: 258
Mfenced rtdsc reads:
Sum: 17579196
Avg: 175
Sum, removed outliers: 17577684
Avg, removed outliers: 175
Lfenced rtdscP reads:
Sum: 7511688
Avg: 75
Sum, removed outliers: 7501608
Avg, removed outliers: 75
Lfenced rtdsc reads:
Sum: 7024428
Avg: 70
Sum, removed outliers: 7015248
Avg, removed outliers: 70
NOT fenced rtdscP reads:
Sum: 6024888
Avg: 60
Sum, removed outliers: 6024888
Avg, removed outliers: 60
NOT fenced rtdsc reads:
Sum: 3274866
Avg: 32
Sum, removed outliers: 3232913
Avg, removed outliers: 35
======================================================
Using 3 dependent floating point divisions:
div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;
Mfenced rtdscP reads:
Sum: 36217404
Avg: 362
Sum, removed outliers: 36097164
Avg, removed outliers: 361
Mfenced rtdsc reads:
Sum: 22973400
Avg: 229
Sum, removed outliers: 22939236
Avg, removed outliers: 229
Lfenced rtdscP reads:
Sum: 13178196
Avg: 131
Sum, removed outliers: 13177872
Avg, removed outliers: 131
Lfenced rtdsc reads:
Sum: 12631932
Avg: 126
Sum, removed outliers: 12631932
Avg, removed outliers: 126
NOT fenced rtdscP reads:
Sum: 12115548
Avg: 121
Sum, removed outliers: 12103236
Avg, removed outliers: 121
NOT fenced rtdsc reads:
Sum: 3335997
Avg: 33
Sum, removed outliers: 3305333
Avg, removed outliers: 35
=================== END OF AMD RYZEN EXPERIMENTS =========================
And here is the bulldozer architecture experiments.
======================= AMD BULLDOZER EXPERIMENTS =========================
AMD A6-4455M
100_000 iteration
Using a *=3;
Mfenced rtdscP reads:
Sum: 32120355
Avg: 321
Sum, removed outliers: 27718117
Avg, removed outliers: 278
Mfenced rtdsc reads:
Sum: 23739715
Avg: 237
Sum, removed outliers: 23013028
Avg, removed outliers: 230
Lfenced rtdscP reads:
Sum: 14274916
Avg: 142
Sum, removed outliers: 13026199
Avg, removed outliers: 131
Lfenced rtdsc reads:
Sum: 11083963
Avg: 110
Sum, removed outliers: 10905271
Avg, removed outliers: 109
NOT fenced rtdscP reads:
Sum: 9361738
Avg: 93
Sum, removed outliers: 8993886
Avg, removed outliers: 90
NOT fenced rtdsc reads:
Sum: 4766349
Avg: 47
Sum, removed outliers: 4310312
Avg, removed outliers: 43
=================================================================
Using 3 dependent floating point divisions:
div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;
Mfenced rtdscP reads:
Sum: 38748536
Avg: 387
Sum, removed outliers: 36719312
Avg, removed outliers: 368
Mfenced rtdsc reads:
Sum: 35106459
Avg: 351
Sum, removed outliers: 33514331
Avg, removed outliers: 335
Lfenced rtdscP reads:
Sum: 23867349
Avg: 238
Sum, removed outliers: 23203849
Avg, removed outliers: 232
Lfenced rtdsc reads:
Sum: 21991975
Avg: 219
Sum, removed outliers: 21394828
Avg, removed outliers: 215
NOT fenced rtdscP reads:
Sum: 19790942
Avg: 197
Sum, removed outliers: 19701909
Avg, removed outliers: 197
NOT fenced rtdsc reads:
Sum: 10841074
Avg: 108
Sum, removed outliers: 10583085
Avg, removed outliers: 106
=================== END OF AMD BULLDOZER EXPERIMENTS =========================
Intel results are:
======================= INTEL EXPERIMENTS =========================
INTEL 4710HQ
100_000 iteration
Using a *=3
Mfenced rtdscP reads:
Sum: 10914893
Avg: 109
Sum, removed outliers: 10820879
Avg, removed outliers: 108
Mfenced rtdsc reads:
Sum: 7866322
Avg: 78
Sum, removed outliers: 7606613
Avg, removed outliers: 76
Lfenced rtdscP reads:
Sum: 4823705
Avg: 48
Sum, removed outliers: 4783842
Avg, removed outliers: 47
Lfenced rtdsc reads:
Sum: 3634106
Avg: 36
Sum, removed outliers: 3463079
Avg, removed outliers: 34
NOT fenced rtdscP reads:
Sum: 2216884
Avg: 22
Sum, removed outliers: 1435830
Avg, removed outliers: 17
NOT fenced rtdsc reads:
Sum: 1736640
Avg: 17
Sum, removed outliers: 986250
Avg, removed outliers: 12
===================================================================
Using 3 dependent floating point divisions:
div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;
Mfenced rtdscP reads:
Sum: 22008705
Avg: 220
Sum, removed outliers: 16097871
Avg, removed outliers: 177
Mfenced rtdsc reads:
Sum: 13086713
Avg: 130
Sum, removed outliers: 12627094
Avg, removed outliers: 126
Lfenced rtdscP reads:
Sum: 9882409
Avg: 98
Sum, removed outliers: 9753927
Avg, removed outliers: 97
Lfenced rtdsc reads:
Sum: 8854943
Avg: 88
Sum, removed outliers: 8435847
Avg, removed outliers: 84
NOT fenced rtdscP reads:
Sum: 7302577
Avg: 73
Sum, removed outliers: 7190424
Avg, removed outliers: 71
NOT fenced rtdsc reads:
Sum: 1726126
Avg: 17
Sum, removed outliers: 1029630
Avg, removed outliers: 12
=================== END OF INTEL EXPERIMENTS =========================
From my point of view, AMD Ryzen should've executed faster. My Intel CPU is almost 5 years old and the AMD CPU is brand new.
I couldn't find the exact source, but, I have read that AMD changed/ decreased the resolution of rdtsc
and rdtscp
instruction while they are updating the architecture from Bulldozer to Ryzen. That is why I get multiple of 36 results when I try to measure the timing of the code. I don't know why they did or where did I find the information, but it is the case. If you have a AMD ryzen machine, I would suggest you to run the experiments and see the timer outputs.
I didn't look at rdpmc
yet, I'll try to update when I read it.
EDIT:
Following up to the comments below.
About warming up: All experiments are just 1 C code. So, even if they are not warmed up in mfenced rdtscp
(the first experiment), they surely are warmed up later.
I am using c
and inline assembly
mixed. I just use gcc main.c -o main
to compile the code. AFAIK, it compiles using O0 optimization. gcc is version 7.4.0
Even to decrease the time, I declared my function as #define
so that they won't be called from the function, which means faster execution.
An example code for how I did the experiments:
#define lfence() asm volatile("lfence\n");
#define mfence() asm volatile("mfence\n");
// reading the low end is enough for the measurement because I don't measure too complex result.
// For complex measurements, I need to shift and OR
#define rdtscp(_readval) asm volatile("rdtscp\n": "=a"(_readval)::"rcx", "rdx");
void rdtscp_doublemfence(){
uint64_t scores[MEASUREMENT_ITERATION] = {0};
printf("Mfenced rtdscP reads:\n");
initvars();
for(int i = 0; i < MEASUREMENT_ITERATION; i++){
mfence();
rdtscp(read1);
mfence();
calculation_to_measure();
mfence();
rdtscp(read2);
mfence();
scores[i] = read2-read1;
initvars();
}
calculate_sum_avg(scores);
}
EDIT2:
Why are you using mfence?
I wasn't using mfence
at the first place. I was just using rdtscp
, do work, rdtscp
again to find the difference.
No idea what you're hoping to learn here by cycle-accurate timing of anti-optimized gcc -O0 output.
I am not using any optimization because I want to measure how many cycles would take instruction to be finished. I will measure code block which includes branches. If I use optimization, the optimization might change it to condmove
and that would break the whole point of the measurement.
I wouldn't be surprised if the non-inline function call and other memory access (from disabling optimization, /facepalm) being mfenced is what makes it a multiple of 36 on your Ryzen.
Also, below, it is the disassembled version of the code. During the measurements, there is no memory access(except read1 and read2, which I believe they are in the cache) or call to other functions.
9fd: 0f ae f0 mfence
a00: 0f 01 f9 rdtscp
a03: 48 89 05 36 16 20 00 mov %rax,0x201636(%rip) # 202040 <read1>
a0a: 0f ae f0 mfence
a0d: 8b 05 15 16 20 00 mov 0x201615(%rip),%eax # 202028 <a21>
a13: 83 c0 03 add $0x3,%eax #Either this or division operations for measurement
a16: 89 05 0c 16 20 00 mov %eax,0x20160c(%rip) # 202028 <a21>
a1c: 0f ae f0 mfence
a1f: 0f 01 f9 rdtscp
a22: 48 89 05 0f 16 20 00 mov %rax,0x20160f(%rip) # 202038 <read2>
a29: 0f ae f0 mfence
a2c: 48 8b 15 05 16 20 00 mov 0x201605(%rip),%rdx # 202038 <read2>
a33: 48 8b 05 06 16 20 00 mov 0x201606(%rip),%rax # 202040 <read1>
a3a: 48 29 c2 sub %rax,%rdx
a3d: 8b 85 ec ca f3 ff mov -0xc3514(%rbp),%eax
回答3:
The code:
register unsigned long a21 asm("r13");
#define calculation_to_measure(){\
a21 +=3;\
}
#define initvars(){\
read1 = 0;\
read2 = 0;\
a21= 21;\
}
// =========== RDTSCP, double mfence ================
// Reference code, others are similar
void rdtscp_doublemfence(){
uint64_t scores[MEASUREMENT_ITERATION] = {0};
printf("Mfenced rtdscP reads:\n");
initvars();
for(int i = 0; i < MEASUREMENT_ITERATION; i++){
mfence();
rdtscp(read1);
mfence();
calculation_to_measure();
mfence();
rdtscp(read2);
mfence();
scores[i] = read2-read1;
initvars();
}
calculate_sum_avg(scores);
}
Results, I only did those in AMD Ryzen machine.|
Using gcc main.c -O0 -o rdtsc
, no optimization. It moves r13 to rax.
Dissassembled code:
9ac: 0f ae f0 mfence
9af: 0f 01 f9 rdtscp
9b2: 48 89 05 7f 16 20 00 mov %rax,0x20167f(%rip) # 202038 <read1>
9b9: 0f ae f0 mfence
9bc: 4c 89 e8 mov %r13,%rax
9bf: 48 83 c0 03 add $0x3,%rax
9c3: 49 89 c5 mov %rax,%r13
9c6: 0f ae f0 mfence
9c9: 0f 01 f9 rdtscp
9cc: 48 89 05 5d 16 20 00 mov %rax,0x20165d(%rip) # 202030 <read2>
9d3: 0f ae f0 mfence
Results:
Mfenced rtdscP reads:
Sum: 32846796
Avg: 328
Sum, removed outliers: 32626008
Avg, removed outliers: 327
Mfenced rtdsc reads:
Sum: 18235980
Avg: 182
Sum, removed outliers: 18108180
Avg, removed outliers: 181
Lfenced rtdscP reads:
Sum: 14351508
Avg: 143
Sum, removed outliers: 14238432
Avg, removed outliers: 142
Lfenced rtdsc reads:
Sum: 11179368
Avg: 111
Sum, removed outliers: 10994400
Avg, removed outliers: 115
NOT fenced rtdscP reads:
Sum: 6064488
Avg: 60
Sum, removed outliers: 6064488
Avg, removed outliers: 60
NOT fenced rtdsc reads:
Sum: 3306394
Avg: 33
Sum, removed outliers: 3278450
Avg, removed outliers: 35
Using gcc main.c -Og -o rdtsc_global
Dissassembled code:
934: 0f ae f0 mfence
937: 0f 01 f9 rdtscp
93a: 48 89 05 f7 16 20 00 mov %rax,0x2016f7(%rip) # 202038 <read1>
941: 0f ae f0 mfence
944: 49 83 c5 03 add $0x3,%r13
948: 0f ae f0 mfence
94b: 0f 01 f9 rdtscp
94e: 48 89 05 db 16 20 00 mov %rax,0x2016db(%rip) # 202030 <read2>
955: 0f ae f0 mfence
Results:
Mfenced rtdscP reads:
Sum: 22819428
Avg: 228
Sum, removed outliers: 22796064
Avg, removed outliers: 227
Mfenced rtdsc reads:
Sum: 20630736
Avg: 206
Sum, removed outliers: 19937664
Avg, removed outliers: 199
Lfenced rtdscP reads:
Sum: 13375008
Avg: 133
Sum, removed outliers: 13374144
Avg, removed outliers: 133
Lfenced rtdsc reads:
Sum: 9840312
Avg: 98
Sum, removed outliers: 9774036
Avg, removed outliers: 97
NOT fenced rtdscP reads:
Sum: 8784684
Avg: 87
Sum, removed outliers: 8779932
Avg, removed outliers: 87
NOT fenced rtdsc reads:
Sum: 3274209
Avg: 32
Sum, removed outliers: 3255480
Avg, removed outliers: 36
Using o1 optimization: gcc main.c -O1 -o rdtsc_o1
Dissassembled code:
a89: 0f ae f0 mfence
a8c: 0f 31 rdtsc
a8e: 48 89 05 a3 15 20 00 mov %rax,0x2015a3(%rip) # 202038 <read1>
a95: 0f ae f0 mfence
a98: 49 83 c5 03 add $0x3,%r13
a9c: 0f ae f0 mfence
a9f: 0f 31 rdtsc
aa1: 48 89 05 88 15 20 00 mov %rax,0x201588(%rip) # 202030 <read2>
aa8: 0f ae f0 mfence
Results:
Mfenced rtdscP reads:
Sum: 28041804
Avg: 280
Sum, removed outliers: 27724464
Avg, removed outliers: 277
Mfenced rtdsc reads:
Sum: 17936460
Avg: 179
Sum, removed outliers: 17931024
Avg, removed outliers: 179
Lfenced rtdscP reads:
Sum: 7110144
Avg: 71
Sum, removed outliers: 7110144
Avg, removed outliers: 71
Lfenced rtdsc reads:
Sum: 6691140
Avg: 66
Sum, removed outliers: 6672924
Avg, removed outliers: 66
NOT fenced rtdscP reads:
Sum: 5970888
Avg: 59
Sum, removed outliers: 5965236
Avg, removed outliers: 59
NOT fenced rtdsc reads:
Sum: 3402920
Avg: 34
Sum, removed outliers: 3280111
Avg, removed outliers: 35
来源:https://stackoverflow.com/questions/58802323/one-thread-counting-other-thread-does-a-job-and-measurement