For-loop efficiency: merging loops

后端 未结 7 601
野趣味
野趣味 2021-01-04 19:46

I have always had the idea that reducing the number of iterations is the way to making programs more efficient. Since I never really confirmed that, I set out to te

7条回答
  •  醉酒成梦
    2021-01-04 20:25

    Your assumptions are basically flawed:

    1. Loop iteration does not incur significant cost.

      This is what CPUs are optimized for: Tight loops. CPU optimizations can go as far as to use dedicated circuitry for the loop counter (PPCs bdnz instruction for example) so that the loop counter overhead is exactly zero. X86 does need a CPU cycle or two afaik, but that's it.

    2. What kills your performance is generally memory accesses.

      Fetching a value from L1 cache already takes a latency of three to four CPU cycles. A single load from L1 cache has more latency than your loop control! More for higher level caches. RAM access takes forever.

    So, to get good performance, you generally need to reduce the time spent on accessing memory. That can be done either by

    • Avoiding memory accesses.

      Most effective, and most easily forgotten optimization. You do not pay for what you don't do.

    • Parallelizing memory accesses.

      Avoid loading some value and have the address of the next needed value depend on this. This optimization is tricky to do as it needs a clear understanding of the dependencies between the different memory accesses.

      This optimization may require some loop-fusion or loop unroling to exploit the independences between the different loop bodies/iterations. In your case, the loop iterations are independent from each other, so they are already as parallel as can be.

      Also, as MSalters rightly points out in the comments: The CPU has a limited amount of registers. How many depends on the architecture, a 32 bit X86 CPU only has eight for instance. Thus, it simply cannot handle ten different pointers at the same time. It will need to store some of the pointers on stack, introducing even more memory accesses. Which is, obviously, in violation of the point above about avoiding memory accesses.

    • Sequentialize memory accesses.

      CPUs are built with the knowledge that the vast majority of memory accesses is sequential, and they are optimized for this. When you start to access an array, the CPU will generally notice pretty quickly, and start prefetching the subsequent values.

    The last point is where your first function fails: You are jumping back and forth between accessing 10 different arrays at 10 totally different memory locations. This reduces the CPUs ability to deduce which cache lines it should prefetch from main memory, and thus reduces overall performance.

提交回复
热议问题