cpu-cache

How does the communication between CPU happen?

六月ゝ 毕业季﹏ 提交于 2021-02-19 05:40:08
问题 Another question about L2/L3 caches explained that L3 can be used for inter process communication (IPC). Are there other methods/pathways for this communication to happen? The reason why it seems that there are other pathways is because Intel nearly halved the amount of L3 cache per core in their newest processor lineup (1.375 MiB per core in SKL-X) vs. previous generations (2.5 MiB per core in Broadwell EP). Per-core private L2 increased from 256k to 1M, though. 回答1: There are inter

Why does instruction cache alignment improve performance in set associative cache implementations?

回眸只為那壹抹淺笑 提交于 2021-02-19 03:16:55
问题 I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything. I understand the concept of cache hits and their importance in computing speed. But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since

Why does instruction cache alignment improve performance in set associative cache implementations?

蹲街弑〆低调 提交于 2021-02-19 03:14:17
问题 I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything. I understand the concept of cache hits and their importance in computing speed. But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since

Why does instruction cache alignment improve performance in set associative cache implementations?

守給你的承諾、 提交于 2021-02-19 03:14:13
问题 I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything. I understand the concept of cache hits and their importance in computing speed. But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since

x86 MESI invalidate cache line latency issue

生来就可爱ヽ(ⅴ<●) 提交于 2021-02-17 02:00:32
问题 I have the following processes , I try to make ProcessB very low latency so I use tight loop all the time and isolate cpu core 2 . global var in shared memory : int bDOIT ; typedef struct XYZ_ { int field1 ; int field2 ; ..... int field20; } XYZ; XYZ glbXYZ ; static void escape(void* p) { asm volatile("" : : "g"(p) : "memory"); } ProcessA (in core 1 ) while(1){ nonblocking_recv(fd,&iret); if( errno == EAGAIN) continue ; if( iret == 1 ) bDOIT = 1 ; else bDOIT = 0 ; } // while ProcessB ( in

Calculating average time for a memory access

戏子无情 提交于 2021-02-16 19:19:38
问题 I find it hard to understand the differences between the local and global miss rate and how to calculate the average time for a memory access and would just like to give an example of a problem that I have tried to solve. I would appreciate if someone could tell me if I'm on the right track, or if I'm wrong what I have missed. Consider the following multilevel cache hierarchy with their seek times and miss rates: L1-cache, 0.5 ns, 20% L2-cache, 1.8 ns, 5% L3-cache, 4.2 ns, 1.5% Main memory,

Cache misses when accessing an array in nested loop

落爺英雄遲暮 提交于 2021-02-15 06:50:33
问题 So I have this question from my professor, and I can not figure out why vector2 is faster and has less cache misses than vector1 . Assume that the code below is a valid compilable C code. Vector2: void incrementVector2(INT4* v, int n) { for (int k = 0; k < 100; ++k) { for (int i = 0; i < n; ++i) { v[i] = v[i] + 1; } } } Vector1: void incrementVector1(INT4* v, int n) { for (int i = 0; i < n; ++i) { for (int k = 0; k < 100; ++k) { v[i] = v[i] + 1; } } } NOTE: INT4 means the integer is 4 Bytes

Way prediction in modern cache

泄露秘密 提交于 2021-02-09 09:17:46
问题 We know that the direct-mapped caches are better than set-associative cache in terms of the cache hit time as there is no search involved for a particular tag. On the other hand, set-associative caches usually show better-hit rate than direct-mapped caches. I read that the modern processors try to combine the benefit of both by using a technique called way-prediction. Where they predict the line of the given set where the hit is most likely to happen and search only in that line. If the

Does Cache empty itself if idle for a long time?

岁酱吖の 提交于 2021-02-09 02:50:52
问题 Does cache memory refresh itself if doesn't encounter any instruction for a threshold amount of time? What I mean is that suppose, I have a multi-core machine and I have isolated core on it. Now, for one of the cores, there was no activity for say a few seconds. In this case, will the last instructions from the instruction cache be flushed after a certain amount of time has passed? I understand this can be architecture dependent but I am looking for general pointers on the concept. 回答1: If a

I don't understand cache miss count between cachegrind vs. perf tool

别等时光非礼了梦想. 提交于 2021-02-08 19:46:37
问题 I am studying about cache effect using a simple micro-benchmark. I think that if N is bigger than cache size, then cache have a miss operation every first reading cache line. In my machine, cache line size=64Byte, so I think totally cache occur N/8 miss operation and cache grind show that. However, perf tool displays different result. It only occur 34,265 cache miss operations. I am doubted about hardware prefetch, so turn off this function in BIOS. anyway, result is same. I really don't know