cpu-architecture

Is LFENCE serializing on AMD processors?

别说谁变了你拦得住时间么 提交于 2019-11-26 08:37:45
问题 In recent Intel ISA documents the lfence instruction has been defined as serializing the instruction stream (preventing out-of-order execution across it). In particular, the description of the instruction includes this line: Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes. Note that this applies to all instructions, not just memory load instructions, making lfence more than just a

Does lock xchg have the same behavior as mfence?

走远了吗. 提交于 2019-11-26 08:35:12
问题 What I\'m wondering is if lock xchg will have similar behavior to mfence from the perspective of one thread accessing a memory location that is being mutated (lets just say at random) by other threads. Does it guarantee I get the most up to date value? Of memory read/write instructions that follow after? The reason for my confusion is: 8.2.2 “Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.” -Intel 64 Developers Manual Vol. 3 Does

Why is a conditional move not vulnerable for Branch Prediction Failure?

大兔子大兔子 提交于 2019-11-26 07:57:20
问题 After reading this post (answer on StackOverflow) (at the optimization section), I was wondering why conditional moves are not vulnerable for Branch Prediction Failure. I found on an article on cond moves here (PDF by AMD). Also there, they claim the performance advantage of cond. moves. But why is this? I don\'t see it. At the moment that that ASM-instruction is evaluated, the result of the preceding CMP instruction is not known yet. 回答1: Mis-predicted branches are expensive A modern

What happens after a L2 TLB miss?

送分小仙女□ 提交于 2019-11-26 07:34:13
问题 I\'m struggling to understand what happens when the first two levels of the Translation Lookaside Buffer result in misses? I am unsure whether \"page walking\" occurs in special hardware circuitry, or whether the page tables are stored in the L2/L3 cache, or whether they only reside in main memory. 回答1: (Some of this is x86 and Intel-specific. Most of the key points apply to any CPU that does hardware page walks. I also discuss ISAs like MIPS that handle TLB misses with software.) Modern x86

Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC

北城余情 提交于 2019-11-26 07:31:40
问题 On recent CPUs (at least the last decade or so) Intel has offered three fixed-function hardware performance counters, in addition to various configurable performance counters. The three fixed counters are: INST_RETIRED.ANY CPU_CLK_UNHALTED.THREAD CPU_CLK_UNHALTED.REF_TSC The first counts retired instructions, the second number of actual cycles, and the last is what interests us. The description for Volume 3 of the Intel Software Developers manual is: This event counts the number of reference

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

我的梦境 提交于 2019-11-26 07:28:19
问题 I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX: FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2. I like to know how to do this best in code and I also want to know how it\'s done internally in the CPU. I mean with the super-scalar architecture. Let\'s say I want to do a long sum such as the following in SSE: //sum = a1*b1 + a2*b2 + a3*b3 +... where a is a scalar and b is a SIMD vector (e.g. from matrix multiplication) sum = _mm_set1_ps(0.0f

Why is a boolean 1 byte and not 1 bit of size?

纵饮孤独 提交于 2019-11-26 07:18:27
问题 In C++, Why is a boolean 1 byte and not 1 bit of size? Why aren\'t there types like a 4-bit or 2-bit integers? I\'m missing out the above things when writing an emulator for a CPU 回答1: Because the CPU can't address anything smaller than a byte. 回答2: From Wikipedia: Historically, a byte was the number of bits used to encode a single character of text in a computer and it is for this reason the basic addressable element in many computer architectures. So byte is the basic addressable unit ,

Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths

时光总嘲笑我的痴心妄想 提交于 2019-11-26 06:48:22
问题 I was playing with the code in this answer, slightly modifying it: BITS 64 GLOBAL _start SECTION .text _start: mov ecx, 1000000 .loop: ;T is a symbol defined with the CLI (-DT=...) TIMES T imul eax, eax lfence TIMES T imul edx, edx dec ecx jnz .loop mov eax, 60 ;sys_exit xor edi, edi syscall Without the lfence I the results I get are consistent with the static analysis in that answer. When I introduce a single lfence I\'d expect the CPU to execute the imul edx, edx sequence of the k-th

How are x86 uops scheduled, exactly?

自古美人都是妖i 提交于 2019-11-26 06:02:02
问题 Modern x86 CPUs break down the incoming instruction stream into micro-operations (uops 1 ) and then schedule these uops out-of-order as their inputs become ready. While the basic idea is clear, I\'d like to know the specific details of how ready instructions are scheduled, since it impacts micro-optimization decisions. For example, take the following toy loop 2 : top: lea eax, [ecx + 5] popcnt eax, eax add edi, eax dec ecx jnz top this basically implements the loop (with the following

How can I determine for which platform an executable is compiled?

半腔热情 提交于 2019-11-26 04:42:17
问题 I have a need to work with Windows executables which are made for x86, x64, and IA64. I\'d like to programmatically figure out the platform by examining the files themselves. My target language is PowerShell but a C# example will do. Failing either of those, if you know the logic required that would be great. 回答1: (from another Q, since removed) Machine type: This is a quick little bit of code I based on some that gets the linker timestamp. This is in the same header, and it seems to work -