cpu-cache | 易学教程

cpu cacheline and prefetch policy

阅读更多关于 cpu cacheline and prefetch policy

问题 I read this article http://igoro.com/archive/gallery-of-processor-cache-effects/. The article said that because cacheline delay, the code: int[] arr = new int[64 * 1024 * 1024]; // Loop 1 for (int i = 0; i < arr.Length; i++) arr[i] *= 3; // Loop 2 for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3; will almost have same execute time, and I wrote some sample c code to test it. I run the code on Xeon(R) E3-1230 V2 with Ubuntu 64bit, ARMv6-compatible processor rev 7 with Debian, and also run

Scenarios when software prefetching manual instructions are reasonable

阅读更多关于 Scenarios when software prefetching manual instructions are reasonable

问题 I have read about that on x86 and x86-64 Intel gcc provides special prefetching instructions: #include <xmmintrin.h> enum _mm_hint { _MM_HINT_T0 = 3, _MM_HINT_T1 = 2, _MM_HINT_T2 = 1, _MM_HINT_NTA = 0 }; void _mm_prefetch(void *p, enum _mm_hint h); Programs can use the _mm_prefetch intrinsic on any pointer in the program. And The different hints to be used with the _mm_prefetch intrinsic are implementation defined. Generally said is that each of the hints have its own meaning. _MM_HINT_T0

Intel PMU event for L1 cache hit event

阅读更多关于 Intel PMU event for L1 cache hit event

I'm trying to count the number of cache hit at different levels (L1, L2 and L3) of cache for a program on Intel Haswell processor. I wrote a program to count the number of L2 and L3 cache hits by monitoring the respective events. To achieve that, I checked Intel x86 Software Development Manual and used the cache_all_request event and cache_miss event for L2 and L3 cache. However, I didn't find the events for L1 cache. Maybe I missed something? My questions are: Which Event Number and UMASK value I should use to count the L1 cache hit events? Clarifications* 1) The final goal I want to achieve

How to calculate effective CPI for a 3 level cache

阅读更多关于 How to calculate effective CPI for a 3 level cache

问题 I am hopelessly stuck on a homework problem, and I would love some help understanding it better. Here is what I was given: CPU base CPI = 2, clock rate = 2GHz Primary Cache, Miss Rate/Instruction = 7% L-2 Cache access time = 15ns L-2 Cache, Local Miss Rate/Instruction = 30% L-3 Cache access time = 30ns L-3 Cache, Global Miss Rate/Instruction = 3%, Main memory access time = 150ns What is the effective CPI ? It is my understanding that I need to calculate the miss penalty for each cache level.

How do cores decide which cache line to invalidate in MESI?

阅读更多关于 How do cores decide which cache line to invalidate in MESI?

I have some misunderstanding about cache lines. I'm using Haswell and Ubuntu . Now let's say we have 2-threaded application in which the following happens. mov [addr], dword 0xAC763F ;starting Thread 1 and Thread 2 Now let`s say the threads perform the following actions in parallel: Thread 1 Thread 2 mov rax, [addr] mov rax, [addr] mov [addr], dword 1 mov [addr], dword 2 Now in my understanding of what's going on is this: Before starting the main thread writes to the corresponding cache line ( addr ) and marks it as Exclusive . If both of the threads Thread 1 and Thread 2 finished reading

Linked lists, arrays, and hardware memory caches

阅读更多关于 Linked lists, arrays, and hardware memory caches

问题 While questions have been asked before about linked lists versus arrays, the answers mostly boil down to what most of us probably have already learned at some point: Lists are good at inserting and deleting Arrays are good at random access Now respectable people like Bjarne Stroustrup have argued that arrays practically always outperform linked lists because they make much better use of the caching architecture implemented in modern hardware. He also states that the performance advantage of

Cache specifications for intel core i7

阅读更多关于 Cache specifications for intel core i7

问题 I am building a cache simulator for a intel core i7 but have a hard time finding the detailed specifications for the L1, L2 and L3 cache (shared). I need the Cacheblock size, cache size, associativity and so on... Can anyone point me in the good direction? 回答1: Intel's Optimization guide describes most of the required specifications per architectural generation (you didn't specify which i7 you have, there are now several generations since Nehalem and up to Haswell). Haswell, for e.g., would

Why do L1 and L2 Cache waste space saving the same data?

阅读更多关于 Why do L1 and L2 Cache waste space saving the same data?

I don't know why L1 Cache and L2 Cache save the same data. For example, let's say we want to access Memory[x] for the first time. Memory[x] is mapped to the L2 Cache first, then the same data piece is mapped to L1 Cache where CPU register can retrieve data from. But we have duplicated data stored on both L1 and L2 cache, isn't it a problem or at least a waste of storage space? I edited your question to ask about why CPUs waste cache space storing the same data in multiple levels of cache, because I think that's what you're asking. Not all caches are like that. The Cache Inclusion Policy for an

Sandy-Bridge CPU specification

阅读更多关于 Sandy-Bridge CPU specification

I was able to put together bits here and there about the Sandy Bridge-E architecture but I am not totally sure about all the parameters e.g. the size of the L2 cache. Can anyone please confirm they are all correct? My main source was the 64-ia-32-architectures-optimization-manual.pdf On sandy bridge, each core has 256KB of L2 ( see the datasheet, section 1.1 ). for 6 cores, that's 1.5MB, but since each core only accesses its own, it's better to always look at it as 256KB per core. Moreover, the peak gflops looks completely wrong. AVX is 16 flops/cycle (as single floats). with 6 cores, that's

Would buffering cache changes prevent Meltdown?

阅读更多关于 Would buffering cache changes prevent Meltdown?

If new CPUs had a cache buffer which was only committed to the actual CPU cache if the instructions are ever committed would attacks similar to Meltdown still be possible? The proposal is to make speculative execution be able to load from memory, but not write to the CPU caches until they are actually committed. TL:DR: yes I think it would solve Spectre (and Meltdown) in their current form (using a flush+read cache-timing side channel to copy the secret data from a physical register), but probably be too expensive (in power cost, and maybe also performance) to be a likely implementation. But