How can I prevent the Rust benchmark library from optimizing away my code?

匿名 (未验证) 提交于 2019-12-03 02:31:01

问题:

I have a simple idea I'm trying to benchmark in Rust. However, when I go to measure it using test::Bencher, the base case that I'm trying to compare against:

#![feature(test)] extern crate test;  #[cfg(test)] mod tests {      use test::black_box;     use test::Bencher;      const ITERATIONS: usize = 100_000;      struct CompoundValue {         pub a: u64,         pub b: u64,         pub c: u64,         pub d: u64,         pub e: u64,     }      #[bench]     fn bench_in_place(b: &mut Bencher) {         let mut compound_value = CompoundValue {             a: 0,             b: 2,             c: 0,             d: 5,             e: 0,         };          let val: &mut CompoundValue = &mut compound_value;          let result = b.iter(|| {             let mut f : u64 = black_box(0);             for _ in 0..ITERATIONS {                 f += val.a + val.b + val.c + val.d + val.e;             }             f = black_box(f);             return f;         });         assert_eq!((), result);     } } 

is optimized away entirely by the compiler, resulting in:

running 1 test test tests::bench_in_place ... bench:           0 ns/iter (+/- 1) 

As you can see in the gist, I have tried to employ the suggestions set forth in the documentation, namely:

  • Making use of the test::black_box method to hide implementation details from the compiler.
  • Returning the calculated value from the closure passed to the iter method.

Are there any other tricks I can try?

回答1:

The problem here is the compiler can see that the result of the loop is the same every time iter calls the closure (just add some constant to f) because val never changes.

Looking at the assembly (by passing --emit asm to the compiler) demonstrates this:

_ZN5tests14bench_in_place20h6a2d53fa00d7c649yaaE:     ; ...     movq    %rdi, %r14     leaq    40(%rsp), %rdi     callq   _ZN3sys4time5inner10SteadyTime3now20had09d1fa7ded8f25mjwE@PLT     movq    (%r14), %rax     testq   %rax, %rax     je  .LBB0_3     leaq    24(%rsp), %rcx     movl    $700000, %edx .LBB0_2:     movq    $0, 24(%rsp)     #APP     #NO_APP     movq    24(%rsp), %rsi     addq    %rdx, %rsi     movq    %rsi, 24(%rsp)     #APP     #NO_APP     movq    24(%rsp), %rsi     movq    %rsi, 24(%rsp)     #APP     #NO_APP     decq    %rax     jne .LBB0_2 .LBB0_3:     leaq    24(%rsp), %rbx     movq    %rbx, %rdi     callq   _ZN3sys4time5inner10SteadyTime3now20had09d1fa7ded8f25mjwE@PLT     leaq    8(%rsp), %rdi     leaq    40(%rsp), %rdx     movq    %rbx, %rsi     callq   _ZN3sys4time5inner30_$RF$$u27$a$u20$SteadyTime.Sub3sub20h940fd3596b83a3c25kwE@PLT     movups  8(%rsp), %xmm0     movups  %xmm0, 8(%r14)     addq    $56, %rsp     popq    %rbx     popq    %r14     retq 

The section between .LBB0_2: and jne .LBB0_2 is what the call to iter compiles down to, it is repeatedly running the code in the closure that you passed to it. The #APP #NO_APP pairs are the black_box calls. You can see that the iter loop doesn't do much: movq is just moving data from register to/from other registers and the stack, and addq/decq are just adding and decrementing some integers.

Looking above that loop, there's movl $700000, %edx: This is loading the constant 700_000 into the edx register... and, suspiciously, 700000 = ITEARATIONS * (0 + 2 + 0 + 5 + 0). (The other stuff in the code isn't so interesting.)

The way to disguise this is to black_box the input, e.g. I might start off with the benchmark written like:

#[bench] fn bench_in_place(b: &mut Bencher) {     let mut compound_value = CompoundValue {         a: 0,         b: 2,         c: 0,         d: 5,         e: 0,     };      b.iter(|| {         let mut f : u64 = 0;         let val = black_box(&mut compound_value);         for _ in 0..ITERATIONS {             f += val.a + val.b + val.c + val.d + val.e;         }         f     }); } 

In particular, val is black_box'd inside the closure, so that the compiler can't precompute the addition and reuse it for each call.

However, this is still optimised to be very fast: 1 ns/iter for me. Checking the assembly again reveals the problem (I've trimmed the assembly down to just the loop that contains the APP/NO_APP pairs, i.e. the calls to iter's closure):

.LBB0_2:     movq    %rcx, 56(%rsp)     #APP     #NO_APP     movq    56(%rsp), %rsi     movq    8(%rsi), %rdi     addq    (%rsi), %rdi     addq    16(%rsi), %rdi     addq    24(%rsi), %rdi     addq    32(%rsi), %rdi     imulq   $100000, %rdi, %rsi     movq    %rsi, 56(%rsp)     #APP     #NO_APP     decq    %rax     jne .LBB0_2 

Now the compiler has seen that val doesn't change over the course of the for loop, so it has correctly transformed the loop into just summing all the elements of val (that's the sequence of 4 addqs) and then multiplying that by ITERATIONS (the imulq).

To fix this, we can do the same thing: move the black_box deeper, so that the compiler can't reason about the value between different iterations of the loop:

#[bench] fn bench_in_place(b: &mut Bencher) {     let mut compound_value = CompoundValue {         a: 0,         b: 2,         c: 0,         d: 5,         e: 0,     };      b.iter(|| {         let mut f : u64 = 0;         for _ in 0..ITERATIONS {             let val = black_box(&mut compound_value);             f += val.a + val.b + val.c + val.d + val.e;         }         f     }); } 

This version now takes 137,142 ns/iter for me, although the repeated calls to black_box probably cause non-trivial overhead (having to repeatedly write to the stack, and then read it back).

We can look at the asm, just to be sure:

.LBB0_2:     movl    $100000, %ebx     xorl    %edi, %edi     .align  16, 0x90 .LBB0_3:     movq    %rdx, 56(%rsp)     #APP     #NO_APP     movq    56(%rsp), %rax     addq    (%rax), %rdi     addq    8(%rax), %rdi     addq    16(%rax), %rdi     addq    24(%rax), %rdi     addq    32(%rax), %rdi     decq    %rbx     jne .LBB0_3     incq    %rcx     movq    %rdi, 56(%rsp)     #APP     #NO_APP     cmpq    %r8, %rcx     jne .LBB0_2 

Now the call to iter is two loops: the outer loop that calls the closure many times (.LBB0_2: to jne .LBB0_2), and the for loop inside the closure (.LBB0_3: to jne .LBB0_3). The inner loop is indeed doing a call to black_box (APP/NO_APP) followed by 5 additions. The outer loop is setting f to zero (xorl %edi, %edi), running the inner loop, and then black_boxing f (the second APP/NO_APP).

(Benchmarking exactly what you want to benchmark can be tricky!)



回答2:

The problem with your benchmark is that the optimizer knows that your CompoundValue is going to be immutable during the benchmark, thus it can strengh-reduce the loop and thus compile it down to a constant value.

The solution is to use test::black_box on the parts of your CompoundValue. Or even better, try to get rid of the loop (unless you want to benchmark loop performance), and let Bencher.iter(..) do it's job.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!