intel

Is clflush or clflushopt atomic when system crash?

有些话、适合烂在心里 提交于 2021-01-27 05:27:33
问题 Commonly, cacheline is 64B but atomicity of non-volatile memory is 8B. For example: x[1]=100; x[2]=100; clflush(x); x is cacheline aligned, and is initially set to 0 . System crashs in clflush(); Is it possible x[1]=0 , x[2]=100 after reboot? 回答1: Under the following assumptions: I assume that the code you've shown represents a sequence of x86 assembly instructions rather than actual C code that is yet to be compiled. I also assume that the code is being executed on a Cascade Lake processor

Why doesn't Ice Lake have MOVDIRx like tremont? Do they already have better ones?

杀马特。学长 韩版系。学妹 提交于 2021-01-27 04:46:49
问题 I notice that Intel Tremont has 64 bytes store instructions with MOVDIRI and MOVDIR64B. Those guarantees atomic write to memory, whereas don't guarantee the load atomicity. Moreover, the write is weakly ordered, immediately followed fencing may be needed. I find no MOVDIRx in IceLake. Why doesn't Ice Lake need such instructions like MOVDIRx ? (At the bottom of page 15) Intel® ArchitectureInstruction Set Extensions and Future FeaturesProgramming Reference https://software.intel.com/sites

How does loop address alignment affect the speed on Intel x86_64?

一世执手 提交于 2021-01-27 04:13:29
问题 I'm seeing 15% performance degradation of the same C++ code compiled to exactly same machine instructions but located on differently aligned addresses. When my tiny main loop starts at 0x415220 it's faster then when it is at 0x415250. I'm running this on Intel Core2 Duo. I use gcc 4.4.5 on x86_64 Ubuntu. Can anybody explain the cause of slowdown and how I can force gcc to optimally align the loop? Here is the disassembly for both cases with profiler annotation: 415220 576 12.56%

How does Intel X86 implements total order over stores

微笑、不失礼 提交于 2021-01-05 09:16:06
问题 X86 guarantees a total order over all stores due to its TSO memory model. My question is if anyone has an idea how this is actually implemented. I have a good impression how all the 4 fences are implemented, so I can explain how local order is preserved. But the 4 fences will just give PO; it won't give you TSO (I know TSO allows older stores to jump in front of newer loads so only 3 out of 4 fences are needed). Total order over all memory actions over a single address is responsibility of

How should I approach to find number of pipeline stages in my Laptop's CPU

浪尽此生 提交于 2020-12-23 08:20:25
问题 I want to look into how latest processors differs from standard RISC V implementation (RISC V having 5 stage pipeline - fetch, decode, memory , ALU , Write back) but not able to find how should I start approaching the problem so as to find the current implementation of pipelining at processor I tried referring Intel documentation for i7-4510U documentation but it was not much help 回答1: Haswell's pipeline length is reportedly 14 stages (on a uop-cache hit), 19 stages when fetching from L1i for

Where data goes after Eviction from cache set in case of Intel Core i3/i7

柔情痞子 提交于 2020-12-05 12:29:05
问题 The L1/L2 cache are inclusive in Intel and L1 / L2 cache is 8 way associativity, means in a set there are 8 different cache lines exist. The cache lines are operated as a whole, means if I want to remove few bytes from a cache line, the whole cache line will be removed , not the only those bytes which I want to remove. Am I right ? Now, my question is whenever a cache line of a set is removed/evicted from cache, either by some other process or by using clflush(manual eviction of a cache line

About Adaptive Mode for L1 Cache in Hyper-threading

混江龙づ霸主 提交于 2020-12-03 04:09:51
问题 I'm a student doing some research on Hyper-threading recently. I'm a little confused about the feature - L1 Data Cache Context Mode. In the architecture optimization manual, it was described that L1 cache can operate in two modes: The first level cache can operate in two modes depending on a context-ID bit: Shared mode: The L1 data cache is fully shared by two logical processors. Adaptive mode: In adaptive mode, memory accesses using the page directory is mapped identically across logical

About Adaptive Mode for L1 Cache in Hyper-threading

独自空忆成欢 提交于 2020-12-03 04:07:02
问题 I'm a student doing some research on Hyper-threading recently. I'm a little confused about the feature - L1 Data Cache Context Mode. In the architecture optimization manual, it was described that L1 cache can operate in two modes: The first level cache can operate in two modes depending on a context-ID bit: Shared mode: The L1 data cache is fully shared by two logical processors. Adaptive mode: In adaptive mode, memory accesses using the page directory is mapped identically across logical