I got this program from this link (https://gist.github.com/jiewmeng/3787223).I have been searching the web with the idea of gaining a better understanding of processor cache
Your lengthMod variable doesn't do what you think it does. You want it to limit the size of your data set, but you have 2 problems there -
lengthMod is 1k (0x400), then all indices lower than 0x400 (meaning i=1 to 63) would simply map to index 0, so you'll always hit the cache. That's probably why the results are so fast. Instead use lengthMod - 1 to create a correct mask (0x400 --> 0x3ff, which would mask just the upper bits and leave the lower ones intact).lengthMod are not a power of 2, so doing the lengthMod-1 isn't going to work there as some of the mask bits would still be zeros. Either remove them from the list, or use a modulo operation instead of lengthMod-1 altogether. See also my answer here for a similar case.Another issue is that 16B jumps are probably not enough to skip a cachline as most common CPUs work with 64 byte cachelines, so you get only one miss for every 4 iterations. Use (i*64) instead.