Array/Linked list: performance depends on the *direction* of traversal? [closed]

痴心易碎 提交于 2019-12-03 03:13:58

Talking about PC hardware, early hardware prefetchers (say circa 2005) were better at detecting and prefetching forward accesses, but more recent hardware should be good at detecting both directions. If you are interested in mobile hardware, it is totally possible that it still implements basic forward-only prefetching.

Outside of proper prefetch implemented in the MMU, which actually detects access patterns, it is very common for hardware to get more than one cache line when a cache miss occurs. Often this takes the form of simply getting next cache line, in addition to the required one, when a miss occurs. This implementation would give the forward direction a big advantage by effectively halving the cache miss rate in that case (this assumes prefetching is ineffective).

Locally, on Core i7, I get slightly better results for the linked list version at ~3.3 ms for the whole iteration, vs 3.5 ms for the array version - when using the original program (which iterates the link list in reverse order of creation). So I don't see the same effect you did.

The inner loop for your test, checking the value of val, has a big impact. The current loop will cause a lot of mispredicts, unless the JIT compiler is smart enough to use CMOV or something similar. It seems that in my test, it was - since I got about 1 ns / iteration for small iteration counts that fit in L1. 1 ns (about 3 cycles) isn't consistent with a full branch mis-predict. When I changed it to do an unconditional val += msg.value1, the array version got a significant boost, even in 1,000,000 iteration case (which won't even fit in L3, probably).

Interestingly enough, the same transformation (val += msg.value1) made the linked list version slightly slower. With the transformation, the array version was considerably faster at small iteration counts (inside L2, and the two approaches were comparable outside). From caliper:

  length method         ns linear runtime
     100  ARRAY       63.7 =
     100 LINKED      190.1 =
    1000  ARRAY      725.7 =
    1000 LINKED     1788.5 =
 1000000  ARRAY  2904083.2 ===
 1000000 LINKED  3043820.4 ===
10000000  ARRAY 23160128.5 ==========================
10000000 LINKED 25748352.0 ==============================

The behavior for small iteration counts is easier to explain - the linked list, which has to use pointer chasing, has a data dependency between each iteration of the loop. That is, each iteration depends on the previous, because the address to load comes from the previous element. The array doesn't have this same data dependency - only the increment of i is dependent, and this is very fast (i is certainly in a register here). So the loop can be much better pipelined in the array case.

I don't know the answer, but I would start with looking at the size of the generated bytecode. Since in the array case, the number of iterations is known (cnt is hardcoded and final), the compiler may have inlined some iterations, saving on the jump and comparisons instructions.

Also, if you know the basics of how a program works at the low-level layers, looking at the disassembled bytecode might give you some hints. Even if you are not fluent with assembler languages, it is not too hard to understand a simple program like yours (I was surprised at how much I could figure out the first time I saw some disassembled java code).

Hope this helps.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!