cpu-architecture | 易学教程

While pipelining, can you consecutively write mov to the same register, or does it require 3 NOPs like add does?

阅读更多关于 While pipelining, can you consecutively write mov to the same register, or does it require 3 NOPs like add does?

问题 This is the correct way to implement mov and add through x86 when incorporating pipelining and the necessary NOPs you need. mov $10, eax NOP NOP NOP add $2, eax If I wanted to change eax with mov, could I immedietely overwrite it with another mov since you're just overwriting what is already there, or do I need to write 3 NOPs again so it can finish the WMEDF cycle? mov $10, eax mov $12, eax or mov $10, eax NOP NOP NOP mov $12, eax 回答1: This is the correct way to implement mov and add through

Preserving the Execution pipeline

阅读更多关于 Preserving the Execution pipeline

问题 Return types are frequently checked for errors. But, the code that will continue to execute may be specified in different ways. if(!ret) { doNoErrorCode(); } exit(1); or if(ret) { exit(1); } doNoErrorCode(); One way heavyweight CPU's can speculate about the branches taken in near proximity/locality using simple statistics - I studied a 4-bit mechanism for branch speculation (-2,-1,0,+1,+2) where zero is unknown and 2 will be considered a true branch. Considering the simple technique above, my

Write Allocate / Fetch on Write Cache Policy

阅读更多关于 Write Allocate / Fetch on Write Cache Policy

问题 I couldn't find a source that explains how the policy works in great detail. The combinations of write policies are explained in Jouppi's Paper for the interested. This is how I understood it. A write request is sent from cpu to cache. Request results in a cache-miss. A cache block is allocated for this request in cache.(Write-Allocate) Write request block is fetched from lower memory to the allocated cache block.(Fetch-on-Write) Now we are able to write onto allocated and updated by fetch

What causes the DTLB_LOAD_MISSES.WALK_* performance events to occur?

阅读更多关于 What causes the DTLB_LOAD_MISSES.WALK_* performance events to occur?

问题 Consider the following loop: .loop: add rsi, STRIDE mov eax, dword [rsi] dec ebp jg .loop where STRIDE is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the only loop in the code. That is, it's not being initialized or touched before the loop. On Linux, all of the 4K virtual pages of the buffer will be mapped on-demand to the same physical page. I've ran this code for all possible strides in the range 0-8192. The measured number of

Double-precision operations: 32-bit vs 64-bit machines

阅读更多关于 Double-precision operations: 32-bit vs 64-bit machines

问题 Why don't we see twice better performance when executing a 64-bit operations (e.g. Double precision operation) on a 64-bit machine, compared to executing on a 32-bit machine? In a 32-bit machine, don't we need to fetch from memory twice as much? more importantly, dont we need twice as much cycles to execute a 64-bit operation? 回答1: “64-bit machine” is an ambiguous term but usually means that the processor's General-Purpose Registers are 64-bit wide. Compare 8086 and 8088, which have the same

Are Intel x86_64 processors not only pipelined architecture, but also superscalar?

阅读更多关于 Are Intel x86_64 processors not only pipelined architecture, but also superscalar?

问题 Are Intel x86_64 processors not only pipelined architecture, but also superscalar? Pipelining - these two sequences execute in parallel (different stages of the same pipeline-unit in the same clock, for example ADD with 4 stages): stage1 -> stage2 -> stage3 -> stage4 -> nothing nothing -> stage1 -> stage2 -> stage3 -> stage4 Superscalar - these two sequences execute in parallel (two instructions can be launched to different pipeline-units in the same clock, for example ADD and MUL): ADD

Why does high-memory not exist for 64-bit cpu?

阅读更多关于 Why does high-memory not exist for 64-bit cpu?

问题 While I am trying to understand the high memory problem for 32-bit cpu and Linux, why is there no high-memory problem for 64-bit cpu? In particular, how is the division of virtual memory into kernel space and user space changed, so that the requirement of high memory doesn't exist for 64-bit cpu? Thanks. 回答1: A 32-bit system can only address 4GB of memory. In Linux this is divided into 3GB of user space and 1GB of kernel space. This 1GB is sometimes not enough so the kernel might need to map

What is the maximum possible IPC can be achieved by Intel Nehalem Microarchitecture?

阅读更多关于 What is the maximum possible IPC can be achieved by Intel Nehalem Microarchitecture?

问题 Is there an estimation for the maximum Instructions Per Cycle achievable by the Intel Nehalem Architecture? Also, what is the bottleneck that effects the maximum Instructions Per Cycle? 回答1: TL:DR : Intel Core, Nehalem, and Sandybridge / IvyBridge: a maximum of 5 IPC, including 1 macro-fused cmp+branch to get 5 instructions into 4 fused-domain uops, and the rest being single-uop instruction. (up to 2 of these can be micro-fused store or load+ALU.) Haswell up to 9th Gens: a maximum of 6

About the RIDL vulnerabilities and the “replaying” of loads

阅读更多关于 About the RIDL vulnerabilities and the “replaying” of loads

问题 I'm trying to understand the RIDL class of vulnerability. This is a class of vulnerabilities that is able to read stale data from various micro-architectural buffers. Today the known vulnerabilities exploits: the LFBs, the load ports, the eMC and the store buffer. The paper linked is mainly focused on LFBs. I don't understand why the CPU would satisfy a load with the stale data in an LFB. I can imagine that if a load hits in L1d it is internally "replayed" until the L1d brings data into an

Why do L1 and L2 Cache waste space saving the same data?

阅读更多关于 Why do L1 and L2 Cache waste space saving the same data?

问题 I don't know why L1 Cache and L2 Cache save the same data. For example, let's say we want to access Memory[x] for the first time. Memory[x] is mapped to the L2 Cache first, then the same data piece is mapped to L1 Cache where CPU register can retrieve data from. But we have duplicated data stored on both L1 and L2 cache, isn't it a problem or at least a waste of storage space? 回答1: I edited your question to ask about why CPUs waste cache space storing the same data in multiple levels of cache