cpu-cache

According to Intel my cache should be 24-way associative though its 12-way, how is that?

允我心安 提交于 2019-12-05 12:55:52
According to “Intel 64 and IA-32 architectures optimization reference manual,” April 2012 page 2-23 The physical addresses of data kept in the LLC data arrays are distributed among the cache slices by a hash function, such that addresses are uniformly distributed. The data array in a cache block may have 4/8/12/16 ways corresponding to 0.5M/1M/1.5M/2M block size. However, due to the address distribution among the cache blocks from the software point of view, this does not appear as a normal N-way cache. My computer is a 2-core Sandy Bridge with a 3 MB, 12-way set associative LLC cache. That

how to read L2 cache hit/miss rate in Android (ARM)?

北城余情 提交于 2019-12-05 08:31:02
问题 I found a way to read L1(data and instruction) cache using http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4237.html. I want to read L2 performance counters too. Is there anyone who knows how to measure L2 cache hit rate possibly with ARM assembly or in higher level like Java? 回答1: Accessing performance data for the L2 is dependent on the L2-controller. I don't know how many different ones there are, but for current A9 platforms the PL310 is pretty common and features event

How is an LRU cache implemented in a CPU?

不羁的心 提交于 2019-12-04 22:59:29
问题 I'm studying up for an interview and want to refresh my memory on caching. If a CPU has a cache with an LRU replacement policy, how is that actually implemented on the chip? Would each cache line store a timestamp tick? Also what happens in a dual core system where both CPUs write to the one address simultaneously? 回答1: For a traditional cache with only two ways, a single bit per set can be used to track LRU. On any access to a set that hits, the bit can be set to the way that did not hit.

Invalidating the CPU's cache

血红的双手。 提交于 2019-12-04 22:47:27
问题 When my program performs a load operation with acquire semantics/store operation with release semantics or perhaps a full-fence, it invalidates the CPU's cache. My question is this: which part of the cache is actually invalidated? only the cache-line that held the variable that I've used acquire/release? or perhaps the entire cache is invalidated? (L1 + L2 + L3 .. and so on?). Is there a difference in this subject when I use acquire/release semantics, or when i use a full-fence? 回答1: I'm not

Scenarios when software prefetching manual instructions are reasonable

人走茶凉 提交于 2019-12-04 16:13:02
I have read about that on x86 and x86-64 Intel gcc provides special prefetching instructions: #include <xmmintrin.h> enum _mm_hint { _MM_HINT_T0 = 3, _MM_HINT_T1 = 2, _MM_HINT_T2 = 1, _MM_HINT_NTA = 0 }; void _mm_prefetch(void *p, enum _mm_hint h); Programs can use the _mm_prefetch intrinsic on any pointer in the program. And The different hints to be used with the _mm_prefetch intrinsic are implementation defined. Generally said is that each of the hints have its own meaning. _MM_HINT_T0 fetches data to all levels of the cache for inclusive caches and to the lowest level cache for exclusive

Is there a cheaper serializing instruction than cpuid?

十年热恋 提交于 2019-12-04 15:35:37
I have seen the related question including here and here , but it seems that the only instruction ever mentioned for serializing rdtsc is cpuid . Unfortunately, cpuid takes roughly 1000 cycles on my system, so I am wondering if anyone knows of a cheaper (fewer cycles and no read or write to memory) serializing instruction? I looked at iret , but that seems to change control flow, which is also undesirable. I have actually looked at the whitespaper linked in Alex's answer about rstscp , but it says: The RDTSCP instruction waits until all previous instructions have been executed before reading

cpu cacheline and prefetch policy

强颜欢笑 提交于 2019-12-04 13:02:37
I read this article http://igoro.com/archive/gallery-of-processor-cache-effects/ . The article said that because cacheline delay, the code: int[] arr = new int[64 * 1024 * 1024]; // Loop 1 for (int i = 0; i < arr.Length; i++) arr[i] *= 3; // Loop 2 for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3; will almost have same execute time, and I wrote some sample c code to test it. I run the code on Xeon(R) E3-1230 V2 with Ubuntu 64bit, ARMv6-compatible processor rev 7 with Debian, and also run it on Core 2 T6600. All results are not what the article said. My code is as follows: long int jobTime

Can't sample hardware cache events with linux perf

二次信任 提交于 2019-12-04 12:03:59
For some reason, I can't sample ( perf record ) hardware cache events: # perf record -e L1-dcache-stores -a -c 100 -- sleep 5 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.607 MB perf.data (~26517 samples) ] # perf script but I can count them ( perf stat ): # perf stat -e L1-dcache-stores -a -- sleep 5 Performance counter stats for 'sleep 5': 711,781 L1-dcache-stores 5.000842990 seconds time elapsed I tried on different CPUs, OS versions (and kernel versions), perf versions but the result is the same. Is this an expected behaviour? What is the reason? Can

How to produce the cpu cache effect in C and java?

一个人想着一个人 提交于 2019-12-04 10:51:44
问题 In Ulrich Drepper's paper What every programmer should know about memory, the 3rd part: CPU Caches, he shows a graph that shows the relationship between "working set" size and the cpu cycle consuming per operation (in this case, sequential reading). And there are two jumps in the graph which indicate the size of L1 cache and L2 cache. I wrote my own program to reproduce the effect in c. It just simply read a int[] array sequentially from head to tail, and I've tried different size of the

Cache-friendly copying of an array with readjustment by known index, gather, scatter

回眸只為那壹抹淺笑 提交于 2019-12-04 07:54:28
问题 Suppose we have an array of data and another array with indexes. data = [1, 2, 3, 4, 5, 7] index = [5, 1, 4, 0, 2, 3] We want to create a new array from elements of data at position from index . Result should be [4, 2, 5, 7, 3, 1] Naive algorithm works for O(N) but it performs random memory access. Can you suggest CPU cache friendly algorithm with the same complexity. PS In my certain case all elements in data array are integers. PPS Arrays might contain millions of elements. PPPS I'm ok with