cpu-architecture | 易学教程

How does the communication between CPU happen?

阅读更多关于 How does the communication between CPU happen?

问题 Another question about L2/L3 caches explained that L3 can be used for inter process communication (IPC). Are there other methods/pathways for this communication to happen? The reason why it seems that there are other pathways is because Intel nearly halved the amount of L3 cache per core in their newest processor lineup (1.375 MiB per core in SKL-X) vs. previous generations (2.5 MiB per core in Broadwell EP). Per-core private L2 increased from 256k to 1M, though. 回答1: There are inter

Why does instruction cache alignment improve performance in set associative cache implementations?

阅读更多关于 Why does instruction cache alignment improve performance in set associative cache implementations?

问题 I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything. I understand the concept of cache hits and their importance in computing speed. But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since

Why does instruction cache alignment improve performance in set associative cache implementations?

阅读更多关于 Why does instruction cache alignment improve performance in set associative cache implementations?

Why does instruction cache alignment improve performance in set associative cache implementations?

阅读更多关于 Why does instruction cache alignment improve performance in set associative cache implementations?

Calculating average time for a memory access

阅读更多关于 Calculating average time for a memory access

问题 I find it hard to understand the differences between the local and global miss rate and how to calculate the average time for a memory access and would just like to give an example of a problem that I have tried to solve. I would appreciate if someone could tell me if I'm on the right track, or if I'm wrong what I have missed. Consider the following multilevel cache hierarchy with their seek times and miss rates: L1-cache, 0.5 ns, 20% L2-cache, 1.8 ns, 5% L3-cache, 4.2 ns, 1.5% Main memory,

Why memory reordering is not a problem on single core/processor machines?

阅读更多关于 Why memory reordering is not a problem on single core/processor machines?

问题 Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions: x = 0; f = 0; Thread #1: while (f == 0); print x; Thread #2: x = 42; f = 1; I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution. However I don't understand why this is not a problem on a single core machine, with those

how are barriers/fences and acquire, release semantics implemented microarchitecturally?

阅读更多关于 how are barriers/fences and acquire, release semantics implemented microarchitecturally?

问题 A lot of questions SO and articles/books such as https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2018.12.08a.pdf, Preshing's articles such as https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ and his entire series of articles, talk about memory ordering abstractly, in terms of the ordering and visibility guarantees provided by different barriers types. My question is how are these barriers and memory ordering semantics

memory_order_relaxed and visibility

阅读更多关于 memory_order_relaxed and visibility

问题 Consider two threads, T1 and T2, that store and load an atomic integer a_i respectively. And let's further assume that the store is executed before the load starts being executed. By before, I mean in the absolute sense of time. T1 T2 // other_instructions here... // ... a_i.store(7, memory_order_relaxed) // other instructions here // other instructions here // ... a_i.load(memory_order_relaxed) // other instructions here Is it guaranteed that T2 sees the value 7 after the load? 回答1: Is it

Data hazards and nops insertion

阅读更多关于 Data hazards and nops insertion

问题 Consider the following code sequence that is executed on a processor that doesnt supports stalls and only supports ALU-ALU forwarding : I1: lw $1, 40($6) I2: add $6, $2, $2 I3: sw $6, 50($1) I4: lw $5, -16($5) I5: sw $5, -16($5) I6: add $5, $5, $5 Now the only way to run this code on this processor is to insert nops . The solution is : I1: lw $1, 40($6) I2: add $6, $2, $2 I22: nop I3: sw $6, 50($1) I4: lw $5, -16($5) I44: nop I45: nop I5: sw $5, -16($5) I6: add $5, $5, $5 My question is why

Why are there no NAND, NOR and XNOR instructions in X86?

阅读更多关于 Why are there no NAND, NOR and XNOR instructions in X86?

问题 They're one of the simplest "instructions" you could perform on a computer (they're the first ones I'd personally implement) Performing NOT(AND(x, y)) doubles execution time AND dependency chain length AND code size BMI1 introduced "andnot" which is a meaningful addition that is a unique operation - why not the ones in the title of this question? You usually read answers among the lines of "they take up valuable op-code space" but then I look at all of the kmask operations introduced with