Write a program to get CPU cache sizes and levels

血红的双手。 提交于 2019-12-03 03:00:40

I'm not sure if this is the only problem here, but it's definitely the biggest one - your code would very quickly trigger the HW stream prefetchers, making you almost always hit in L1 or L2 latencies.

More details can be found here - http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers

For your benchmark You should either disable them (through BIOS or any other means), or at least make your steps longer by replacing j*16 (* 4 bytes per int = 64B, one cache line - a classic unit stride for the stream detector), with j*64 (4 cache lines). The reason being - the prefetcher can issue 2 prefetches per stream request, so it runs ahead of your code when you do unit strides, may still get a bit ahead of you when your code is jumping over 2 lines, but become mostly useless with longer jumps (3 isn't good because of your modulu, you need a divider of step_size)

Update the questions with the new results and we can figure out if there's anything else here.


EDIT1: Ok, I ran the fixed code and got -

1 time: 1.321001
4 time: 1.321998
8 time: 1.336288
16 time: 1.324994
24 time: 1.319742
32 time: 1.330685
64 time: 1.536644
128 time: 1.536933
256 time: 1.669329
384 time: 1.592145
512 time: 2.036315
1024 time: 2.214269
2048 time: 2.407584
3072 time: 2.259108
4096 time: 2.584872
5120 time: 2.203696
6144 time: 2.335194
7168 time: 2.322517
8192 time: 5.554941
9216 time: 2.230817

It makes much more sense if you ignore a few columns - you jump after the 32k (L1 size), but instead of jumping after 256k (L2 size), we get too good of a result for 384, and jump only at 512k. Last jump is at 8M (my LLC size), but 9k is broken again.

This allows us to spot the next error - ANDing with size mask only makes sense when it's a power of 2, otherwise you don't wrap around, but instead repeat some of the last addresses again (which ends up in optimistic results since it's fresh in the cache).

Try replacing the ... & size_mask with % steps[i]/sizeof(int), the modulu is more expensive but if you want to have these sizes you need it (or alternatively, a running index that gets zeroed whenever it exceeds the current size)

I think you'd be better off looking at the CPUID instruction. It's not trivial, but there should be information on the web.

Also, if you're on Windows, you can use GetLogicalProcessorInformation function. Mind you, it's only present in Windows XP SP3 and above. I know nothing about Linux/Unix.

If you're using GNU/Linux you can just read the content of the files /proc/cpuinfo and for further details /sys/devices/system/cpu/*. It is just common under UNIX not to define a API, where a plain file can do that job anyway.

I would also take a look at the source of util-linux, it contains a program named lscpu. This should be give you an example how to retrieve the required information.

// update
http://git.kernel.org/cgit/utils/util-linux/util-linux.git/tree/sys-utils/lscpu.c

If just taken a look at the source their. It basically reading from the file mentioned above, thats all. An therefore it is absolutely valid to read also from that files, they are provided by the kernel.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!