cpu-cache

Linked lists, arrays, and hardware memory caches

元气小坏坏 提交于 2019-12-04 05:17:13
While questions have been asked before about linked lists versus arrays, the answers mostly boil down to what most of us probably have already learned at some point: Lists are good at inserting and deleting Arrays are good at random access Now respectable people like Bjarne Stroustrup have argued that arrays practically always outperform linked lists because they make much better use of the caching architecture implemented in modern hardware. He also states that the performance advantage of arrays increases with their size. While I basically understand his arguments and agree with him, I

Cache specifications for intel core i7

非 Y 不嫁゛ 提交于 2019-12-04 03:51:05
I am building a cache simulator for a intel core i7 but have a hard time finding the detailed specifications for the L1, L2 and L3 cache (shared). I need the Cacheblock size, cache size, associativity and so on... Can anyone point me in the good direction? Intel's Optimization guide describes most of the required specifications per architectural generation (you didn't specify which i7 you have, there are now several generations since Nehalem and up to Haswell). Haswell, for e.g., would have - Note that if you're building a simulator, you'll want to have as many of these feature as possible

Cache miss rate of array

人盡茶涼 提交于 2019-12-03 21:22:50
I'm trying to figure out how to calculate the miss rate of an array. I have the answer, but I'm not understanding how the answer was arrived at. I have the following code: int C[N1][N2]; int A[N1][N3]; int B[N3][N2]; initialize_arrays(A, B, C, N1, N2, N3); for(i=0; i<N1; ++i) for(j=0; j<N2; ++j) for(k=0; k<N3, ++k) C[i][j] += A[i][k] * B[k][j]; I also have the following info: N1=N2=N3=2048 (what does this mean??). The processor has an L1 data cache of 32kB with line size of 64B (no L2 cache). (what is line size?) I know the miss rate of array C is (N^2/16)/N^3 . I know the formula is (total

how to read L2 cache hit/miss rate in Android (ARM)?

一笑奈何 提交于 2019-12-03 21:14:26
I found a way to read L1(data and instruction) cache using http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4237.html . I want to read L2 performance counters too. Is there anyone who knows how to measure L2 cache hit rate possibly with ARM assembly or in higher level like Java? Accessing performance data for the L2 is dependent on the L2-controller. I don't know how many different ones there are, but for current A9 platforms the PL310 is pretty common and features event counters which can capture requests and hits (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc

clflush not flushing the instruction cache

a 夏天 提交于 2019-12-03 15:15:02
问题 Consider the following code segment: #include <stdio.h> #include <stdlib.h> #include <stdint.h> #define ARRAYSIZE(arr) (sizeof(arr)/sizeof(arr[0])) inline void clflush(volatile void *p) { asm volatile ("clflush (%0)" :: "r"(p)); } inline uint64_t rdtsc() { unsigned long a, d; asm volatile ("cpuid; rdtsc" : "=a" (a), "=d" (d) : : "ebx", "ecx"); return a | ((uint64_t)d << 32); } inline int func() { return 5;} inline void test() { uint64_t start, end; char c; start = rdtsc(); func(); end = rdtsc

When L1 misses are a lot different than L2 accesses… TLB related?

帅比萌擦擦* 提交于 2019-12-03 13:58:56
I have been running some benchmarks on some algorithms and profiling their memory usage and efficiency (L1/L2/TLB accesses and misses), and some of the results are quite intriguing for me. Considering an inclusive cache hierarchy (L1 and L2 caches), shouldn't the number of L1 cache misses coincide with the number of L2 cache accesses ? One of the explanations I find would be TLB related: when a virtual address is not mapped in TLB, the system automatically skips searches in some cache levels. Does this seem legitimate? First, inclusive cache hierarchies may not be so common as you assume. For

Invalidating the CPU's cache

家住魔仙堡 提交于 2019-12-03 13:43:34
When my program performs a load operation with acquire semantics/store operation with release semantics or perhaps a full-fence, it invalidates the CPU's cache. My question is this: which part of the cache is actually invalidated? only the cache-line that held the variable that I've used acquire/release? or perhaps the entire cache is invalidated? (L1 + L2 + L3 .. and so on?). Is there a difference in this subject when I use acquire/release semantics, or when i use a full-fence? I'm not an expert on this, but I stumbled on this document, maybe it's helpful http://www.rdrop.com/users/paulmck

Write a program to get CPU cache sizes and levels

让人想犯罪 __ 提交于 2019-12-03 12:38:39
问题 I want to write a program to get my cache size(L1, L2, L3). I know the general idea of it. Allocate a big array Access part of it of different size each time. So I wrote a little program. Here's my code: #include <cstdio> #include <time.h> #include <sys/mman.h> const int KB = 1024; const int MB = 1024 * KB; const int data_size = 32 * MB; const int repeats = 64 * MB; const int steps = 8 * MB; const int times = 8; long long clock_time() { struct timespec tp; clock_gettime(CLOCK_REALTIME, &tp);

How does CLFLUSH work for an address that is not in cache yet?

南笙酒味 提交于 2019-12-03 11:48:53
问题 We are trying to use the Intel CLFLUSH instruction to flush the cache content of a process in Linux at the userspace. We create a very simple C program that first access a large array and then call the CLFLUSH to flush the virtual address space of the whole array. We measure the latency it takes for CLFLUSH to flush the whole array. The size of the array in the program is an input and we vary the input from 1MB to 40MB with a step of 2MB. In our understanding, the CLFLUSH should flush the

How would you generically detect cache line associativity from user mode code?

一世执手 提交于 2019-12-03 11:21:47
问题 I'm putting together a small patch for the cachegrind/callgrind tool in valgrind which will auto-detect, using completely generic code, CPU instruction and cache configuration (right now only x86/x64 auto-configures, and other architectures don't provide CPUID type configuration to non-privileged code). This code will need to execute entirely in a non-privileged context i.e. pure user mode code. It also needs to be portable across very different POSIX implementations, so grokking /proc