cpu-architecture

Why are there no NAND, NOR and XNOR instructions in X86?

假如想象 提交于 2021-02-10 17:46:36
问题 They're one of the simplest "instructions" you could perform on a computer (they're the first ones I'd personally implement) Performing NOT(AND(x, y)) doubles execution time AND dependency chain length AND code size BMI1 introduced "andnot" which is a meaningful addition that is a unique operation - why not the ones in the title of this question? You usually read answers among the lines of "they take up valuable op-code space" but then I look at all of the kmask operations introduced with

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

独自空忆成欢 提交于 2021-02-10 07:12:33
问题 TL;DR: In a producer-consumer queue does it ever make sense to put an unnecessary (from C++ memory model viewpoint) memory fence, or unnecessarily strong memory order to have better latency at the expense of possibly worse throughput? C++ memory model is executed on the hardware by having some sort of memory fences for stronger memory orders and not having them on weaker memory orders. In particular, if producer does store(memory_order_release) , and consumer observes the stored value with

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

。_饼干妹妹 提交于 2021-02-10 07:11:55
问题 TL;DR: In a producer-consumer queue does it ever make sense to put an unnecessary (from C++ memory model viewpoint) memory fence, or unnecessarily strong memory order to have better latency at the expense of possibly worse throughput? C++ memory model is executed on the hardware by having some sort of memory fences for stronger memory orders and not having them on weaker memory orders. In particular, if producer does store(memory_order_release) , and consumer observes the stored value with

Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

懵懂的女人 提交于 2021-02-10 07:11:01
问题 TL;DR: In a producer-consumer queue does it ever make sense to put an unnecessary (from C++ memory model viewpoint) memory fence, or unnecessarily strong memory order to have better latency at the expense of possibly worse throughput? C++ memory model is executed on the hardware by having some sort of memory fences for stronger memory orders and not having them on weaker memory orders. In particular, if producer does store(memory_order_release) , and consumer observes the stored value with

gem5 cache statistics - reset and dump

杀马特。学长 韩版系。学妹 提交于 2021-02-09 11:54:32
问题 I am trying to get familiar with gem5 simulator. To start, I wrote a simple program with int main() { m5_reset_stats(0, 0); m5_dump_stats(0, 0); return 0; } I compiled it with util/m5/m5op_x86.S and ran it using... ./build/X86/gem5.opt configs/example/se.py --caches -c ~/tmp/hello The m5out/stats.txt shows (among other things)... system.cpu.dcache.ReadReq_hits::total 881 system.cpu.dcache.WriteReq_hits::total 917 system.cpu.dcache.ReadReq_misses::total 54 system.cpu.dcache.WriteReq_misses:

Way prediction in modern cache

泄露秘密 提交于 2021-02-09 09:17:46
问题 We know that the direct-mapped caches are better than set-associative cache in terms of the cache hit time as there is no search involved for a particular tag. On the other hand, set-associative caches usually show better-hit rate than direct-mapped caches. I read that the modern processors try to combine the benefit of both by using a technique called way-prediction. Where they predict the line of the given set where the hit is most likely to happen and search only in that line. If the

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

拜拜、爱过 提交于 2021-02-09 04:37:06
问题 In the general case, how can an instruction that can take memory or register operands ever be slower with memory operands then mov + mov -> instruction -> mov + mov Based on the throughput and latency found in Agner Fog's instruction tables (looking at Skylake in my case, p238) I see that the following numbers for the btr/bts instructions: instruction, operands, uops fused domain, uops unfused domain, latency, throughput mov r,r 1 1 0-1 .25 mov m,r 1 2 2 1 mov r,m 1 1 2 .5 ... bts/btr r,r 1 1

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

安稳与你 提交于 2021-02-09 04:34:53
问题 In the general case, how can an instruction that can take memory or register operands ever be slower with memory operands then mov + mov -> instruction -> mov + mov Based on the throughput and latency found in Agner Fog's instruction tables (looking at Skylake in my case, p238) I see that the following numbers for the btr/bts instructions: instruction, operands, uops fused domain, uops unfused domain, latency, throughput mov r,r 1 1 0-1 .25 mov m,r 1 2 2 1 mov r,m 1 1 2 .5 ... bts/btr r,r 1 1

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

狂风中的少年 提交于 2021-02-09 04:33:48
问题 In the general case, how can an instruction that can take memory or register operands ever be slower with memory operands then mov + mov -> instruction -> mov + mov Based on the throughput and latency found in Agner Fog's instruction tables (looking at Skylake in my case, p238) I see that the following numbers for the btr/bts instructions: instruction, operands, uops fused domain, uops unfused domain, latency, throughput mov r,r 1 1 0-1 .25 mov m,r 1 2 2 1 mov r,m 1 1 2 .5 ... bts/btr r,r 1 1

How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

大憨熊 提交于 2021-02-09 04:33:11
问题 In the general case, how can an instruction that can take memory or register operands ever be slower with memory operands then mov + mov -> instruction -> mov + mov Based on the throughput and latency found in Agner Fog's instruction tables (looking at Skylake in my case, p238) I see that the following numbers for the btr/bts instructions: instruction, operands, uops fused domain, uops unfused domain, latency, throughput mov r,r 1 1 0-1 .25 mov m,r 1 2 2 1 mov r,m 1 1 2 .5 ... bts/btr r,r 1 1