cpu-architecture

How does the communication between CPU happen?

六月ゝ 毕业季﹏ 提交于 2021-02-19 05:40:08
问题 Another question about L2/L3 caches explained that L3 can be used for inter process communication (IPC). Are there other methods/pathways for this communication to happen? The reason why it seems that there are other pathways is because Intel nearly halved the amount of L3 cache per core in their newest processor lineup (1.375 MiB per core in SKL-X) vs. previous generations (2.5 MiB per core in Broadwell EP). Per-core private L2 increased from 256k to 1M, though. 回答1: There are inter

Why does instruction cache alignment improve performance in set associative cache implementations?

回眸只為那壹抹淺笑 提交于 2021-02-19 03:16:55
问题 I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything. I understand the concept of cache hits and their importance in computing speed. But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since

Why does instruction cache alignment improve performance in set associative cache implementations?

蹲街弑〆低调 提交于 2021-02-19 03:14:17
问题 I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything. I understand the concept of cache hits and their importance in computing speed. But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since

Why does instruction cache alignment improve performance in set associative cache implementations?

守給你的承諾、 提交于 2021-02-19 03:14:13
问题 I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything. I understand the concept of cache hits and their importance in computing speed. But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since

Calculating average time for a memory access

戏子无情 提交于 2021-02-16 19:19:38
问题 I find it hard to understand the differences between the local and global miss rate and how to calculate the average time for a memory access and would just like to give an example of a problem that I have tried to solve. I would appreciate if someone could tell me if I'm on the right track, or if I'm wrong what I have missed. Consider the following multilevel cache hierarchy with their seek times and miss rates: L1-cache, 0.5 ns, 20% L2-cache, 1.8 ns, 5% L3-cache, 4.2 ns, 1.5% Main memory,

Why memory reordering is not a problem on single core/processor machines?

末鹿安然 提交于 2021-02-16 13:52:07
问题 Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions: x = 0; f = 0; Thread #1: while (f == 0); print x; Thread #2: x = 42; f = 1; I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution. However I don't understand why this is not a problem on a single core machine, with those

how are barriers/fences and acquire, release semantics implemented microarchitecturally?

半腔热情 提交于 2021-02-16 12:57:07
问题 A lot of questions SO and articles/books such as https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2018.12.08a.pdf, Preshing's articles such as https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ and his entire series of articles, talk about memory ordering abstractly, in terms of the ordering and visibility guarantees provided by different barriers types. My question is how are these barriers and memory ordering semantics

memory_order_relaxed and visibility

这一生的挚爱 提交于 2021-02-15 07:36:51
问题 Consider two threads, T1 and T2, that store and load an atomic integer a_i respectively. And let's further assume that the store is executed before the load starts being executed. By before, I mean in the absolute sense of time. T1 T2 // other_instructions here... // ... a_i.store(7, memory_order_relaxed) // other instructions here // other instructions here // ... a_i.load(memory_order_relaxed) // other instructions here Is it guaranteed that T2 sees the value 7 after the load? 回答1: Is it

Data hazards and nops insertion

放肆的年华 提交于 2021-02-11 14:52:34
问题 Consider the following code sequence that is executed on a processor that doesnt supports stalls and only supports ALU-ALU forwarding : I1: lw $1, 40($6) I2: add $6, $2, $2 I3: sw $6, 50($1) I4: lw $5, -16($5) I5: sw $5, -16($5) I6: add $5, $5, $5 Now the only way to run this code on this processor is to insert nops . The solution is : I1: lw $1, 40($6) I2: add $6, $2, $2 I22: nop I3: sw $6, 50($1) I4: lw $5, -16($5) I44: nop I45: nop I5: sw $5, -16($5) I6: add $5, $5, $5 My question is why

Why are there no NAND, NOR and XNOR instructions in X86?

﹥>﹥吖頭↗ 提交于 2021-02-10 17:48:01
问题 They're one of the simplest "instructions" you could perform on a computer (they're the first ones I'd personally implement) Performing NOT(AND(x, y)) doubles execution time AND dependency chain length AND code size BMI1 introduced "andnot" which is a meaningful addition that is a unique operation - why not the ones in the title of this question? You usually read answers among the lines of "they take up valuable op-code space" but then I look at all of the kmask operations introduced with