Why is the execution time of this function call changing?

后端 未结 2 449
日久生厌
日久生厌 2021-02-03 17:10

Preface

This issue seems to only affect Chrome/V8, and may not be reproducible in Firefox or other browsers. In summary, the execution time of a functi

2条回答
  •  不要未来只要你来
    2021-02-03 17:35

    Since this is getting so much interest (and updates to the question), I thought I'd provide some additional detail.

    The new simplified test case is great: it's very simple, and very clearly shows a problem.

    function test(callback) {
      let start = performance.now();
      for (let i = 0; i < 1e6; i++) callback();
      console.log(`${callback.name} took ${(performance.now() - start).toFixed(2)}ms`);
    }
    
    var exampleA = (a,b) => 10**10;
    var exampleB = (a,b) => 10**10;
    
    // one callback -> fast
    for (let i = 0; i < 10; i++) test(exampleA);
    
    // introduce a second callback -> much slower forever
    for (let i = 0; i < 10; i++) test(exampleB);
    for (let i = 0; i < 10; i++) test(exampleA);
    

    On my machine, I'm seeing times go as low as 0.23 ms for exampleA alone, and then they go up to 7.3ms when exampleB comes along, and stay there. Wow, a 30x slowdown! Clearly that's a bug in V8? Why wouldn't the team jump on fixing this?

    Well, the situation is more complicated than it seems at first.

    Firstly, the "slow" case is the normal situation. That's what you should expect to see in most code. It's still pretty fast! You can do a million function calls (plus a million exponentiations, plus a million loop iterations) in just 7 milliseconds! That's only 7 nanoseconds per iteration+call+exponentiation+return!

    Actually, that analysis was a bit simplified. In reality, an operation on two constants like 10**10 will be constant-folded at compile time, so once exampleA and exampleB get optimized, the optimized code for them will return 1e10 immediately, without doing any multiplications. On the flip side, the code here contains a small oversight that causes the engine to have to do more work: exampleA and exampleB take two parameters (a, b), but they're called without any arguments simply as callback(). Bridging this difference between expected and actual number of parameters is fast, but on a test like this that doesn't do much else, it amounts to about 40% of the total time spent. So a more accurate statement would be: it takes about 4 nanoseconds to do a loop iteration plus a function call plus a materialization of a number constant plus a function return, or 7 ns if the engine additionally has to adapt the arguments count of the call.

    So what about the initial results for just exampleA, how can that case be so much faster? Well, that's the lucky situation that hits various optimizations in V8 and can take several shortcuts -- in fact it can take so many shortcuts that it ends up being a misleading microbenchmark: the results it produces don't reflect real situations, and can easily cause an observer to draw incorrect conclusions. The general effect that "always the same callback" is (typically) faster than "several different callbacks" is certainly real, but this test significantly distorts the magnitude of the difference. At first, V8 sees that it's always the same function that's getting called, so the optimizing compiler decides to inline the function instead of calling it. That avoids the adaptation of arguments right off the bat. After inlining, the compiler can also see that the result of the exponentiation is never used, so it drops that entirely. The end result is that this test tests an empty loop! See for yourself:

    function test_empty(no_callback) {
      let start = performance.now();
      for (let i = 0; i < 1e6; i++) {}
      console.log(`empty loop took ${(performance.now() - start).toFixed(2)}ms`);
    }
    

    That gives me the same 0.23ms as calling exampleA. So contrary to what we thought, we didn't measure the time it takes to call and execute exampleA, in reality we measured no calls at all, and no 10**10 exponentiations either. (If you like more direct proof, you can run the original test in d8 or node with --print-opt-code and see the disassembly of the optimized code that V8 generates internally.)

    All that lets us conclude a few things:

    (1) This is not a case of "OMG there's this horrible slowdown that you must be aware of and avoid in your code". The default performance you get when you don't worry about this is great. Sometimes when the stars align you might see even more impressive optimizations, but… to put it lightly: just because you only get presents on a few occasions per year, doesn't mean that all the other non-gift-bearing days are some horrible bug that must be avoided.

    (2) The smaller your test case, the bigger the observed difference between default speed and lucky fast case. If your callbacks are doing actual work that the compiler can't just eliminate, then the difference will be smaller than seen here. If your callbacks are doing more work than a single operation, then the fraction of overall time that's spent on the call itself will be smaller, so replacing the call with inlining will make less of a difference than it does here. If your functions are called with the parameters they need, that will avoid the needless penalization seen here. So while this microbenchmark manages to create the misleading impression that there's a shockingly large 30x difference, in most real applications it will be between maybe 4x in extreme cases and "not even measurable at all" for many other cases.

    (3) Function calls do have a cost. It's great that (for many languages, including JavaScript) we have optimizing compilers that can sometimes avoid them via inlining. If you have a case where you really, really care about every last bit of performance, and your compiler happens to not inline what you think it should be inlining (for whatever reason: because it can't, or because it has internal heuristics that decide not to), then it can give significant benefits to redesign your code a bit -- e.g. you could inline by hand, or otherwise restructure your control flow to avoid millions of calls to tiny functions in your hottest loops. (Don't blindly overdo it though: having too few too big functions isn't great for optimization either. Usually it's best to not worry about this. Organize your code into chunks that make sense, let the engine take care of the rest. I'm only saying that sometimes, when you observe specific problems, you can help the engine do its job better.) If you do need to rely on performance-sensitive function calls, then an easy tuning you can do is to make sure that you're calling your functions with exactly as many arguments as they expect -- which is probably often what you would do anyway. Of course optional arguments have their uses as well; like in so many other cases the extra flexibility comes with a (small) performance cost, which is often negligible, but can be taken into consideration when you feel that you have to.

    (4) Observing such performance differences can understandably be surprising and sometimes even frustrating. Unfortunately, the nature of optimizations is such that they can't always be applied: they rely on making simplifying assumptions and not covering every case, otherwise they wouldn't be fast any more. We work very hard to give you reliable, predictable performance, with as many fast cases and as few slow cases as possible, and no steep cliffs between them. But we cannot escape the reality that we can't possibly "just make everything fast". (Which of course isn't to say that there's nothing left to do: every additional year of engineering work brings additional performance gains.) If we wanted to avoid all cases where more-or-less similar code exhibits noticeably different performance, then the only way to accomplish that would be to not do any optimizations at all, and instead leave everything at baseline ("slow") implementations -- and I don't think that would make anyone happy.

    EDIT to add: It seems there are major differences between different CPUs here, which probably explains why previous commenters have reported so wildly differing results. On hardware I can get my hands on, I'm seeing:

    • i7 6600U: 3.3 ms for inlined case, 28 ms for calling
    • i7 3635QM: 2.8 ms for inlined case, 10 ms for calling
    • i7 3635QM, up-to-date microcode: 2.8 ms for inlined case, 26 ms for calling
    • Ryzen 3900X: 2.5 ms for inlined case, 5 ms for calling

    This is all with Chrome 83/84 on Linux; it's very much possible that running on Windows or Mac would yield different results (because CPU/microcode/kernel/sandbox are closely interacting with each other). If you find these hardware differences shocking, read up on "spectre".

提交回复
热议问题