cpu-architecture

Do FP and integer division compete for the same throughput resources on x86 CPUs?

那年仲夏 提交于 2020-08-04 05:43:21
问题 We know that Intel CPUs do integer division and FP div / sqrt on a not-fully-pipelined divide execution unit on port 0. We know this from IACA output, other published stuff, and experimental testing. (e.g. https://agner.org/optimize/) But are there independent dividers for FP and integer (competing only for dispatch via port 0), or does interleaving two div-throughput-bound workloads make their cost add nearly linearly, if one is integer and the other is FP? This is complicated by Intel CPUs

Fastest Offset Read for a Small Array

ぃ、小莉子 提交于 2020-08-03 05:48:43
问题 For speed, I would like to read one of 8 registers referenced by the value in a 9th register. The fastest way I see to do this is to use 3 conditional jumps (checking 3 bits in the 9th register). This should have shorter latency than the standard way of doing this with an offset memory read, but this still requires at least 6 clock cycles (at least one test plus one conditional jmp per bit check). Is there any commercial CPU (preferably x86/x64) with an intrinsic to do this "offset register

What is instruction fusion in contemporary x86 processors?

孤街浪徒 提交于 2020-08-02 08:40:06
问题 What I understand is, there are two types of instruction fusions: Micro-operation fusion Macro-operation fusion Micro-operations are those operations that can be executed in 1 clock cycle. If several micro-operations are fused, we obtain an "instruction". If several instructions are fused, we obtain a Macro-operation. If several macro-operations are fused, we obtain Macro-operation fusing. Am I correct? 回答1: No, fusion is totally separate from how one complex instruction (like cpuid or lock

Why is ONE basic arithmetic operation in for loop body executed SLOWER THAN TWO arithmetic operations?

落花浮王杯 提交于 2020-07-28 06:25:11
问题 While I experimented with measuring time of execution of arithmetic operations, I came across very strange behavior. A code block containing a for loop with one arithmetic operation in the loop body was always executed slower than an identical code block, but with two arithmetic operations in the for loop body. Here is the code I ended up testing: #include <iostream> #include <chrono> #define NUM_ITERATIONS 100000000 int main() { // Block 1: one operation in loop body { int64_t x = 0, y = 0;

Calculate memory accesses

拈花ヽ惹草 提交于 2020-07-22 06:39:27
问题 xor dword [0x301a80], 0x12345 How many memory access when we know the op code and addressing mode is 2 bytes? If I understand correctly, even thought it is 0x12345, this is acctually still 4 bytes and we cant attach it to 0x301a80, right? So we have here: 2 + 4 + 4 bytes (And not 2 + 3.5 + 2.5 = 8) which is 4 memory access. Am I think right? 回答1: The total instruction size is 10 bytes (in 32-bit mode). That takes probably 0 to 2 I-cache accesses on a modern x86 to fetch in aligned 16-byte

Confused about Intel Optane DC SSD usage as extra RAM with IMDT? [closed]

旧时模样 提交于 2020-07-15 09:46:06
问题 Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 12 days ago . Improve this question I'm a little confused about Intel Optane DC. I want that my Optane DC will be able to perform as DRAM and storage both. On the one hand, I understood that only "Intel Optane DC Persistent Memory DIMM" is able to perform as DRAM.That it because he has 2 modes

x86 Segmented Memory

£可爱£侵袭症+ 提交于 2020-07-10 05:44:50
问题 While reading The Art of Assembly the other day, I came to the section here on memory layout. It started discussing segmented memory, and I didn't think it made a lot of sense. Splitting memory into segments makes perfect sense as a way for organization, but using the function segment + offset , what do you do when the function repeats its outputs? e.g. 1038 + 57 , 57 + 1038 , and 1095 + 0 all come out to the linear address 1095. Isn't that a bad thing? Wouldn't you accidentally address the

x86 Segmented Memory

风流意气都作罢 提交于 2020-07-10 05:42:38
问题 While reading The Art of Assembly the other day, I came to the section here on memory layout. It started discussing segmented memory, and I didn't think it made a lot of sense. Splitting memory into segments makes perfect sense as a way for organization, but using the function segment + offset , what do you do when the function repeats its outputs? e.g. 1038 + 57 , 57 + 1038 , and 1095 + 0 all come out to the linear address 1095. Isn't that a bad thing? Wouldn't you accidentally address the

How does a CPU know if an address in RAM contains an integer, a pre-defined CPU instruction, or any other kind of data?

≯℡__Kan透↙ 提交于 2020-07-09 08:44:40
问题 The reason this gets me confused is that all addresses hold a sequence of 1's and 0's. So how does the CPU differentiate, let's say, 00000100 (integer) from 00000100 (CPU instruction)? 回答1: First of all, different commands have different values (opcodes). That's how the CPU knows what to do. Finally, the questions remains: What's a command, what's data? Modern PCs are working with the von Neumann -Architecture ( https://en.wikipedia.org/wiki/John_von_Neumann) where data and opcodes are stored

why we can't move a 64-bit immediate value to memory?

泪湿孤枕 提交于 2020-07-09 05:15:53
问题 First I am a little bit confused with the differences between movq and movabsq , my text book says: The regular movq instruction can only have immediate source operands that can be represented as 32-bit two’s-complement numbers. This value is then sign extended to produce the 64-bit value for the destination. The movabsq instruction can have an arbitrary 64-bit immediate value as its source operand and can only have a register as a destination. I have two questions to this. Question 1 The