cpu-architecture | 易学教程

Why are there no NAND, NOR and XNOR instructions in X86?

阅读更多关于 Why are there no NAND, NOR and XNOR instructions in X86?

问题 They're one of the simplest "instructions" you could perform on a computer (they're the first ones I'd personally implement) Performing NOT(AND(x, y)) doubles execution time AND dependency chain length AND code size BMI1 introduced "andnot" which is a meaningful addition that is a unique operation - why not the ones in the title of this question? You usually read answers among the lines of "they take up valuable op-code space" but then I look at all of the kmask operations introduced with

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

阅读更多关于 Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

问题 TL;DR: In a producer-consumer queue does it ever make sense to put an unnecessary (from C++ memory model viewpoint) memory fence, or unnecessarily strong memory order to have better latency at the expense of possibly worse throughput? C++ memory model is executed on the hardware by having some sort of memory fences for stronger memory orders and not having them on weaker memory orders. In particular, if producer does store(memory_order_release) , and consumer observes the stored value with

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

阅读更多关于 Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

阅读更多关于 Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

gem5 cache statistics - reset and dump

阅读更多关于 gem5 cache statistics - reset and dump

问题 I am trying to get familiar with gem5 simulator. To start, I wrote a simple program with int main() { m5_reset_stats(0, 0); m5_dump_stats(0, 0); return 0; } I compiled it with util/m5/m5op_x86.S and ran it using... ./build/X86/gem5.opt configs/example/se.py --caches -c ~/tmp/hello The m5out/stats.txt shows (among other things)... system.cpu.dcache.ReadReq_hits::total 881 system.cpu.dcache.WriteReq_hits::total 917 system.cpu.dcache.ReadReq_misses::total 54 system.cpu.dcache.WriteReq_misses:

Way prediction in modern cache

阅读更多关于 Way prediction in modern cache

问题 We know that the direct-mapped caches are better than set-associative cache in terms of the cache hit time as there is no search involved for a particular tag. On the other hand, set-associative caches usually show better-hit rate than direct-mapped caches. I read that the modern processors try to combine the benefit of both by using a technique called way-prediction. Where they predict the line of the given set where the hit is most likely to happen and search only in that line. If the

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

阅读更多关于 How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

问题 In the general case, how can an instruction that can take memory or register operands ever be slower with memory operands then mov + mov -> instruction -> mov + mov Based on the throughput and latency found in Agner Fog's instruction tables (looking at Skylake in my case, p238) I see that the following numbers for the btr/bts instructions: instruction, operands, uops fused domain, uops unfused domain, latency, throughput mov r,r 1 1 0-1 .25 mov m,r 1 2 2 1 mov r,m 1 1 2 .5 ... bts/btr r,r 1 1

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

阅读更多关于 How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

阅读更多关于 How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

阅读更多关于 How can memory destination BTS be significantly slower than load / BTS reg,reg / store?