cpu-architecture

While pipelining, can you consecutively write mov to the same register, or does it require 3 NOPs like add does?

三世轮回 提交于 2019-12-08 04:19:58
问题 This is the correct way to implement mov and add through x86 when incorporating pipelining and the necessary NOPs you need. mov $10, eax NOP NOP NOP add $2, eax If I wanted to change eax with mov, could I immedietely overwrite it with another mov since you're just overwriting what is already there, or do I need to write 3 NOPs again so it can finish the WMEDF cycle? mov $10, eax mov $12, eax or mov $10, eax NOP NOP NOP mov $12, eax 回答1: This is the correct way to implement mov and add through

Preserving the Execution pipeline

半城伤御伤魂 提交于 2019-12-08 04:13:20
问题 Return types are frequently checked for errors. But, the code that will continue to execute may be specified in different ways. if(!ret) { doNoErrorCode(); } exit(1); or if(ret) { exit(1); } doNoErrorCode(); One way heavyweight CPU's can speculate about the branches taken in near proximity/locality using simple statistics - I studied a 4-bit mechanism for branch speculation (-2,-1,0,+1,+2) where zero is unknown and 2 will be considered a true branch. Considering the simple technique above, my

Write Allocate / Fetch on Write Cache Policy

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-08 03:22:24
问题 I couldn't find a source that explains how the policy works in great detail. The combinations of write policies are explained in Jouppi's Paper for the interested. This is how I understood it. A write request is sent from cpu to cache. Request results in a cache-miss. A cache block is allocated for this request in cache.(Write-Allocate) Write request block is fetched from lower memory to the allocated cache block.(Fetch-on-Write) Now we are able to write onto allocated and updated by fetch

What causes the DTLB_LOAD_MISSES.WALK_* performance events to occur?

喜欢而已 提交于 2019-12-08 00:25:42
问题 Consider the following loop: .loop: add rsi, STRIDE mov eax, dword [rsi] dec ebp jg .loop where STRIDE is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the only loop in the code. That is, it's not being initialized or touched before the loop. On Linux, all of the 4K virtual pages of the buffer will be mapped on-demand to the same physical page. I've ran this code for all possible strides in the range 0-8192. The measured number of

Double-precision operations: 32-bit vs 64-bit machines

坚强是说给别人听的谎言 提交于 2019-12-07 23:31:34
问题 Why don't we see twice better performance when executing a 64-bit operations (e.g. Double precision operation) on a 64-bit machine, compared to executing on a 32-bit machine? In a 32-bit machine, don't we need to fetch from memory twice as much? more importantly, dont we need twice as much cycles to execute a 64-bit operation? 回答1: “64-bit machine” is an ambiguous term but usually means that the processor's General-Purpose Registers are 64-bit wide. Compare 8086 and 8088, which have the same

Are Intel x86_64 processors not only pipelined architecture, but also superscalar?

烂漫一生 提交于 2019-12-07 22:58:32
问题 Are Intel x86_64 processors not only pipelined architecture, but also superscalar? Pipelining - these two sequences execute in parallel (different stages of the same pipeline-unit in the same clock, for example ADD with 4 stages): stage1 -> stage2 -> stage3 -> stage4 -> nothing nothing -> stage1 -> stage2 -> stage3 -> stage4 Superscalar - these two sequences execute in parallel (two instructions can be launched to different pipeline-units in the same clock, for example ADD and MUL): ADD

Why does high-memory not exist for 64-bit cpu?

妖精的绣舞 提交于 2019-12-07 18:08:15
问题 While I am trying to understand the high memory problem for 32-bit cpu and Linux, why is there no high-memory problem for 64-bit cpu? In particular, how is the division of virtual memory into kernel space and user space changed, so that the requirement of high memory doesn't exist for 64-bit cpu? Thanks. 回答1: A 32-bit system can only address 4GB of memory. In Linux this is divided into 3GB of user space and 1GB of kernel space. This 1GB is sometimes not enough so the kernel might need to map

What is the maximum possible IPC can be achieved by Intel Nehalem Microarchitecture?

廉价感情. 提交于 2019-12-07 18:04:51
问题 Is there an estimation for the maximum Instructions Per Cycle achievable by the Intel Nehalem Architecture? Also, what is the bottleneck that effects the maximum Instructions Per Cycle? 回答1: TL:DR : Intel Core, Nehalem, and Sandybridge / IvyBridge: a maximum of 5 IPC, including 1 macro-fused cmp+branch to get 5 instructions into 4 fused-domain uops, and the rest being single-uop instruction. (up to 2 of these can be micro-fused store or load+ALU.) Haswell up to 9th Gens: a maximum of 6

About the RIDL vulnerabilities and the “replaying” of loads

拜拜、爱过 提交于 2019-12-07 17:46:58
问题 I'm trying to understand the RIDL class of vulnerability. This is a class of vulnerabilities that is able to read stale data from various micro-architectural buffers. Today the known vulnerabilities exploits: the LFBs, the load ports, the eMC and the store buffer. The paper linked is mainly focused on LFBs. I don't understand why the CPU would satisfy a load with the stale data in an LFB. I can imagine that if a load hits in L1d it is internally "replayed" until the L1d brings data into an

Why do L1 and L2 Cache waste space saving the same data?

蓝咒 提交于 2019-12-07 16:33:02
问题 I don't know why L1 Cache and L2 Cache save the same data. For example, let's say we want to access Memory[x] for the first time. Memory[x] is mapped to the L2 Cache first, then the same data piece is mapped to L1 Cache where CPU register can retrieve data from. But we have duplicated data stored on both L1 and L2 cache, isn't it a problem or at least a waste of storage space? 回答1: I edited your question to ask about why CPUs waste cache space storing the same data in multiple levels of cache