cpu-cache

cpu cacheline and prefetch policy

為{幸葍}努か 提交于 2019-12-06 08:41:15
问题 I read this article http://igoro.com/archive/gallery-of-processor-cache-effects/. The article said that because cacheline delay, the code: int[] arr = new int[64 * 1024 * 1024]; // Loop 1 for (int i = 0; i < arr.Length; i++) arr[i] *= 3; // Loop 2 for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3; will almost have same execute time, and I wrote some sample c code to test it. I run the code on Xeon(R) E3-1230 V2 with Ubuntu 64bit, ARMv6-compatible processor rev 7 with Debian, and also run

Scenarios when software prefetching manual instructions are reasonable

独自空忆成欢 提交于 2019-12-06 07:29:16
问题 I have read about that on x86 and x86-64 Intel gcc provides special prefetching instructions: #include <xmmintrin.h> enum _mm_hint { _MM_HINT_T0 = 3, _MM_HINT_T1 = 2, _MM_HINT_T2 = 1, _MM_HINT_NTA = 0 }; void _mm_prefetch(void *p, enum _mm_hint h); Programs can use the _mm_prefetch intrinsic on any pointer in the program. And The different hints to be used with the _mm_prefetch intrinsic are implementation defined. Generally said is that each of the hints have its own meaning. _MM_HINT_T0

Intel PMU event for L1 cache hit event

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-06 04:34:05
I'm trying to count the number of cache hit at different levels (L1, L2 and L3) of cache for a program on Intel Haswell processor. I wrote a program to count the number of L2 and L3 cache hits by monitoring the respective events. To achieve that, I checked Intel x86 Software Development Manual and used the cache_all_request event and cache_miss event for L2 and L3 cache. However, I didn't find the events for L1 cache. Maybe I missed something? My questions are: Which Event Number and UMASK value I should use to count the L1 cache hit events? Clarifications* 1) The final goal I want to achieve

How to calculate effective CPI for a 3 level cache

时间秒杀一切 提交于 2019-12-06 03:03:27
问题 I am hopelessly stuck on a homework problem, and I would love some help understanding it better. Here is what I was given: CPU base CPI = 2, clock rate = 2GHz Primary Cache, Miss Rate/Instruction = 7% L-2 Cache access time = 15ns L-2 Cache, Local Miss Rate/Instruction = 30% L-3 Cache access time = 30ns L-3 Cache, Global Miss Rate/Instruction = 3%, Main memory access time = 150ns What is the effective CPI ? It is my understanding that I need to calculate the miss penalty for each cache level.

How do cores decide which cache line to invalidate in MESI?

℡╲_俬逩灬. 提交于 2019-12-06 02:54:30
I have some misunderstanding about cache lines. I'm using Haswell and Ubuntu . Now let's say we have 2-threaded application in which the following happens. mov [addr], dword 0xAC763F ;starting Thread 1 and Thread 2 Now let`s say the threads perform the following actions in parallel: Thread 1 Thread 2 mov rax, [addr] mov rax, [addr] mov [addr], dword 1 mov [addr], dword 2 Now in my understanding of what's going on is this: Before starting the main thread writes to the corresponding cache line ( addr ) and marks it as Exclusive . If both of the threads Thread 1 and Thread 2 finished reading

Linked lists, arrays, and hardware memory caches

Deadly 提交于 2019-12-06 00:12:30
问题 While questions have been asked before about linked lists versus arrays, the answers mostly boil down to what most of us probably have already learned at some point: Lists are good at inserting and deleting Arrays are good at random access Now respectable people like Bjarne Stroustrup have argued that arrays practically always outperform linked lists because they make much better use of the caching architecture implemented in modern hardware. He also states that the performance advantage of

Cache specifications for intel core i7

独自空忆成欢 提交于 2019-12-05 21:57:17
问题 I am building a cache simulator for a intel core i7 but have a hard time finding the detailed specifications for the L1, L2 and L3 cache (shared). I need the Cacheblock size, cache size, associativity and so on... Can anyone point me in the good direction? 回答1: Intel's Optimization guide describes most of the required specifications per architectural generation (you didn't specify which i7 you have, there are now several generations since Nehalem and up to Haswell). Haswell, for e.g., would

Why do L1 and L2 Cache waste space saving the same data?

南楼画角 提交于 2019-12-05 21:29:45
I don't know why L1 Cache and L2 Cache save the same data. For example, let's say we want to access Memory[x] for the first time. Memory[x] is mapped to the L2 Cache first, then the same data piece is mapped to L1 Cache where CPU register can retrieve data from. But we have duplicated data stored on both L1 and L2 cache, isn't it a problem or at least a waste of storage space? I edited your question to ask about why CPUs waste cache space storing the same data in multiple levels of cache, because I think that's what you're asking. Not all caches are like that. The Cache Inclusion Policy for an

Sandy-Bridge CPU specification

喜欢而已 提交于 2019-12-05 16:43:55
I was able to put together bits here and there about the Sandy Bridge-E architecture but I am not totally sure about all the parameters e.g. the size of the L2 cache. Can anyone please confirm they are all correct? My main source was the 64-ia-32-architectures-optimization-manual.pdf On sandy bridge, each core has 256KB of L2 ( see the datasheet, section 1.1 ). for 6 cores, that's 1.5MB, but since each core only accesses its own, it's better to always look at it as 256KB per core. Moreover, the peak gflops looks completely wrong. AVX is 16 flops/cycle (as single floats). with 6 cores, that's

Would buffering cache changes prevent Meltdown?

本秂侑毒 提交于 2019-12-05 14:50:27
If new CPUs had a cache buffer which was only committed to the actual CPU cache if the instructions are ever committed would attacks similar to Meltdown still be possible? The proposal is to make speculative execution be able to load from memory, but not write to the CPU caches until they are actually committed. TL:DR: yes I think it would solve Spectre (and Meltdown) in their current form (using a flush+read cache-timing side channel to copy the secret data from a physical register), but probably be too expensive (in power cost, and maybe also performance) to be a likely implementation. But