Why are elementwise additions much faster in separate loops than in a combined loop?

后端 未结 10 708
旧巷少年郎
旧巷少年郎 2020-11-22 09:49

Suppose a1, b1, c1, and d1 point to heap memory and my numerical code has the following core loop.

const i         


        
10条回答
  •  刺人心
    刺人心 (楼主)
    2020-11-22 10:04

    OK, the right answer definitely has to do something with the CPU cache. But to use the cache argument can be quite difficult, especially without data.

    There are many answers, that led to a lot of discussion, but let's face it: Cache issues can be very complex and are not one dimensional. They depend heavily on the size of the data, so my question was unfair: It turned out to be at a very interesting point in the cache graph.

    @Mysticial's answer convinced a lot of people (including me), probably because it was the only one that seemed to rely on facts, but it was only one "data point" of the truth.

    That's why I combined his test (using a continuous vs. separate allocation) and @James' Answer's advice.

    The graphs below shows, that most of the answers and especially the majority of comments to the question and answers can be considered completely wrong or true depending on the exact scenario and parameters used.

    Note that my initial question was at n = 100.000. This point (by accident) exhibits special behavior:

    1. It possesses the greatest discrepancy between the one and two loop'ed version (almost a factor of three)

    2. It is the only point, where one-loop (namely with continuous allocation) beats the two-loop version. (This made Mysticial's answer possible, at all.)

    The result using initialized data:

    Enter image description here

    The result using uninitialized data (this is what Mysticial tested):

    Enter image description here

    And this is a hard-to-explain one: Initialized data, that is allocated once and reused for every following test case of different vector size:

    Enter image description here

    Proposal

    Every low-level performance related question on Stack Overflow should be required to provide MFLOPS information for the whole range of cache relevant data sizes! It's a waste of everybody's time to think of answers and especially discuss them with others without this information.

提交回复
热议问题