cpu-architecture

Out-of-order execution vs. speculative execution

雨燕双飞 提交于 2019-12-17 09:49:21
问题 I have read the wikipedia page about out-of-order execution and speculative exectution. What I fail to understant though are the similarities and differences. It seems to me that speculative execution uses out-of-order execution when it has not determined the value of a condition for example. The confusion came when I read the papers of Meltdown and Spectre and did additional research. It is stated in the Meltdown paper that Meltdown is based on out-of-order execution, while some other

Difference between core and processor

筅森魡賤 提交于 2019-12-17 06:19:17
问题 What is the difference between a core and a processor? I've already looked for it on Google, but I'm just having multi-core and multi-processor definition, but it doesn't match what I am looking for. 回答1: A core is usually the basic computation unit of the CPU - it can run a single program context (or multiple ones if it supports hardware threads such as hyperthreading on Intel CPUs), maintaining the correct program state, registers, and correct execution order, and performing the operations

What is the stack engine in the Sandybridge microarchitecture?

Deadly 提交于 2019-12-17 04:34:34
问题 I am reading http://www.realworldtech.com/sandy-bridge/ and I'm facing some problems in understanding some issues: The dedicated stack pointer tracker is also present in Sandy Bridge and renames the stack pointer, eliminating serial dependencies and removing a number of uops. What is a dedicated stack pointer tracker actually? For Sandy Bridge (and the P4), Intel still uses the term ROB. But it is critical to understand that, in this context, it only refers the status array for in-flight uops

Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs

删除回忆录丶 提交于 2019-12-17 02:51:55
问题 I've been racking my brain for a week trying to complete this assignment and I'm hoping someone here can lead me toward the right path. Let me start with the instructor's instructions: Your assignment is the opposite of our first lab assignment, which was to optimize a prime number program. Your purpose in this assignment is to pessimize the program, i.e. make it run slower. Both of these are CPU-intensive programs. They take a few seconds to run on our lab PCs. You may not change the

Is performance reduced when executing loops whose uop count is not a multiple of processor width?

▼魔方 西西 提交于 2019-12-16 20:57:49
问题 I'm wondering how loops of various sizes perform on recent x86 processors, as a function of number of uops. Here's a quote from Peter Cordes who raised the issue of non-multiple-of-4 counts in another question: I also found that the uop bandwidth out of the loop buffer isn't a constant 4 per cycle, if the loop isn't a multiple of 4 uops. (i.e. it's abc, abc, ...; not abca, bcab, ...). Agner Fog's microarch doc unfortunately wasn't clear on this limitation of the loop buffer. The issue is

Direct Arithmetic Operations on Small-sized Numbers in RISC Architectures

徘徊边缘 提交于 2019-12-14 03:58:49
问题 Are there any RISC architectures which allow arithmetic operations to be applied individually to bytes, half-words and other data cells, whose size is less than the size of the CPU general purpose registers? In Intel x86 (IA-32) and x86-64 (known as EM64T or AMD64) processors not only the whole register is available, but its smaller parts are operable as well. Intel ISA allows to perform all the arithmetic operations on the whole register, it's half, quarter and a byte (to be more precise,

How does the VIPT to PIPT conversion work on L1->L2 eviction

给你一囗甜甜゛ 提交于 2019-12-14 03:55:02
问题 This scenario came into my head and it seems a bit basic but I'll ask. So there is a virtual index and physical tag in L1 but the set becomes full so it is evicted. How does the L1 controller get the full physical address from the virtual index and the physical tag in L1 so the line can be inserted into L2? I suppose it could search the TLB for the combination but that seems slow and also it may not be in the TLB at all. Perhaps the full physical address from the original TLB translation is

What are the costs of failed store-to-load forwarding on x86?

我怕爱的太早我们不能终老 提交于 2019-12-14 01:38:37
问题 What are the costs of a failed store-to-load forwarding on recent x86 architectures? In particular, store-to-load forwarding that fails because the load partly overlaps an earlier store, or because the earlier load or store cross some alignment boundary that causes the forwarding to fail. Certainly there is a latency cost: how big is it? Is there also a throughput cost, e.g., does a failed store-to-load forwarding use additional resources that are then unavailable to other loads and stores,

Can AVX2-compiled program still use 32 registers of an AVX-512 capable CPU?

陌路散爱 提交于 2019-12-14 01:28:09
问题 Assuming AVX2-targeted compilation and with C++ intrinsics, if I write an nbody algorithm using 17 registers per body-body computation, can 17th register be indirectly(register rename hardware) or directly(visual studio compiler, gcc compiler) be mapped on an AVX-512 register to cut memory dependency off? For example, skylake architecture has 1 or 2 AVX-512 fma units. Does this number change total registers available too? (specifically, a xeon silver 4114 cpu) If this works, how does it work?

why is it faster to print number in binary using arithmetic instead of _bittest

橙三吉。 提交于 2019-12-13 08:13:06
问题 The purpose of the next two code section is to print number in binary. The first one does this by two instructions (_bittest), while the second does it by pure arithmetic instructions which is three instructions. the first code section: #include <intrin.h> #include <stdio.h> #include <Windows.h> long num = 78002; int main() { unsigned char bits[32]; long nBit; LARGE_INTEGER a, b, f; QueryPerformanceCounter(&a); for (size_t i = 0; i < 100000000; i++) { for (nBit = 0; nBit < 31; nBit++) { bits