cpu-cache

What cache invalidation algorithms are used in actual CPU caches?

谁说胖子不能爱 提交于 2019-12-20 17:59:15
问题 I came to the topic caching and mapping and cache misses and how the cache blocks get replaced in what order when all blocks are already full. There is the least recently used algorithm or the fifo algorithm or the least frequently algorithm and random replacement, ... But what algorithms are used on actual cpu caches? Or can you use all and the... operating system decides what the best algorithm is? Edit: Even when i chose an answer, any further information is welcome ;) 回答1: As hivert said

CPU cache critical stride test giving unexpected results based on access type

旧时模样 提交于 2019-12-20 10:34:42
问题 Inspired by this recent question on SO and the answers given, which made me feel very ignorant, I decided I'd spend some time to learn more about CPU caching and wrote a small program to verify whether I am getting this whole thing right (most likely not, I'm afraid). I'll first write down the assumptions that underlie my expectations, so you could possibly stop me here if those are wrong. Based on what I've read, in general : An n -way associative cache is divided into s sets, each

How to avoid “heap pointer spaghetti” in dynamic graphs?

烂漫一生 提交于 2019-12-20 09:29:55
问题 The generic problem Suppose you are coding a system that consists of a graph, plus graph rewrite rules that can be activated depending on the configuration of neighboring nodes. That is, you have a dynamic graph that grows/shrinks unpredictably during runtime. If you naively use malloc , new nodes are going to be allocated in random positions in memory; after enough time, your heap will be a pointer spaghetti, giving you terrible cache efficiency. Is there any lightweight, incremental

What are _mm_prefetch() locality hints?

元气小坏坏 提交于 2019-12-20 09:06:05
问题 The intrinsics guide says only this much about void _mm_prefetch (char const* p, int i) : Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i. Could you list the possible values for int i parameter and explain their meanings? I've found _MM_HINT_T0 , _MM_HINT_T1 , _MM_HINT_T2 , _MM_HINT_NTA and _MM_HINT_ENTA , but I don't know whether this is an exhaustive list and what they mean. If processor-specific, I would like

Virtually addressed Cache

那年仲夏 提交于 2019-12-20 04:04:08
问题 Relation between cache size and page size How does the associativity and page size constrain the Cache size in virtually addressed cache architecture? Particularly I am looking for an example on the following statement: If C≤(page_size x associativity), the cache index bits come only from page offset (same in Virtual address and Physical address). 回答1: Intel CPUs have used 8-way associative 32kiB L1D with 64B lines for many years, for exactly this reason. Pages are 4k, so the page offset is

boost lockfree spsc_queue cache memory access

那年仲夏 提交于 2019-12-19 08:44:08
问题 I need to be extremely concerned with speed/latency in my current multi-threaded project. Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level. I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue. If the consumer core pops an element from the queue, I presume that means the element (a

boost lockfree spsc_queue cache memory access

岁酱吖の 提交于 2019-12-19 08:44:07
问题 I need to be extremely concerned with speed/latency in my current multi-threaded project. Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level. I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue. If the consumer core pops an element from the queue, I presume that means the element (a

Does x86_64 CPU use the same cache lines for communicate between 2 processes via shared memory?

做~自己de王妃 提交于 2019-12-18 05:20:57
问题 As known all levels of cache L1/L2/L3 on modern x86_64 are virtually indexed, physically tagged. And all cores communicate via Last Level Cache - cache-L3 by using cache coherent protocol MOESI/MESIF over QPI/HyperTransport. For example, Sandybridge family CPU has 4 - 16 way cache L3 and page_size 4KB, then this allows to exchange the data between concurrent processes which are executed on different cores via a shared memory. This is possible because cache L3 can't contain the same physical

Cache bandwidth per tick for modern CPUs

安稳与你 提交于 2019-12-18 02:20:33
问题 What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD? Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any. PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest

Using time stamp counter and clock_gettime for cache miss

[亡魂溺海] 提交于 2019-12-17 22:28:42
问题 As a follow-up to this topic, in order to calculate the memory miss latency, I have wrote the following code using _mm_clflush , __rdtsc and _mm_lfence (which is based on the code from this question/answer). As you can see in the code, I first load the array into the cache. Then I flush one element and therefore the cache line is evicted from all cache levels. I put _mm_lfence in order to preserve the order during -O3 . Next, I used time stamp counter to calculate the latency or reading array