cpu-architecture | 易学教程

Counting number of allocations into the Write Pending Queue - unexpected low result on NV memory

阅读更多关于 Counting number of allocations into the Write Pending Queue - unexpected low result on NV memory

问题 I am trying to use some of the uncore hardware counters, such as: skx_unc_imc0-5::UNC_M_WPQ_INSERTS . It's supposed to count the number of allocations into the Write Pending Queue. The machine has 2 Intel Xeon Gold 5218 CPUs with cascade lake architecture, with 2 memory controllers per CPU. linux version is 5.4.0-3-amd64. I have the following simple loop and I am reading this counter for it. Array elements are 64 byte in size, equal to cache line. for(int i=0; i < 1000000; i++){ array[i]

Why does the latency of the sqrtsd instruction change based on the input? Intel processors

阅读更多关于 Why does the latency of the sqrtsd instruction change based on the input? Intel processors

问题 Well on the Intel intrinsic guide it is stated that the instruction called "sqrtsd" has a latency of 18 cycles. I tested it with my own program and it is correct if, for example, we take 0.15 as input. But when we take 256 (or any 2^x) number then the latency is only 13. Why is that? One theory I had is that since 13 is the latency of "sqrtss" which is the same as "sqrtsd" but done on 32bits floating points then maybe the processor was smart enough to understand taht 256 can fit in 32 bit and

How to find the Number of physical CPU Cores (not logical SMT hyperthreads) via .NET Core?

阅读更多关于 How to find the Number of physical CPU Cores (not logical SMT hyperthreads) via .NET Core?

问题 I want to detect the number of real physical cores, not logical cores, for workloads that scale negatively when more threads compete for private per-core caches, and/or have high enough IPC that running more than one logical thread per core doesn't increase throughput by more than the increase in threading overhead, especially for problems that don't scale perfectly to lots of cores. Or to put it another way, the number of threads that can run without any of them competing for execution

How to find the Number of physical CPU Cores (not logical SMT hyperthreads) via .NET Core?

阅读更多关于 How to find the Number of physical CPU Cores (not logical SMT hyperthreads) via .NET Core?

What's the size of a QWORD on a 64-bit machine?

阅读更多关于 What's the size of a QWORD on a 64-bit machine?

问题 I'm currently looking to find an answer to the above question. So far I found people saying, that the word size refers to the size of a processor register, which would suggest on a 64-bit machine the word size being 64 bits and thus a QWORD (4 * word) being 256 bits in size. But on the other hand I found sources like this saying the size would be 128 bits (64 bits for 32-bit and doubled this for 64-bit), while even then others suggest the size would be 64 bits. But the last one is somehow

Intel's CLWB instruction invalidating cache lines

阅读更多关于 Intel's CLWB instruction invalidating cache lines

问题 I am trying to find configuration or memory access pattern for Intel's clwb instruction that would not invalidate cache line. I am testing on Intel Xeon Gold 5218 processor with NVDIMMs. Linux version is 5.4.0-3-amd64. I tried using Device−DAX mode and directly mapping this char device to the address space. I also tried adding this non-volatile memory as a new NUMA node and using numactl --membind command to bind memory to it. In both cases when I use clwb to cached address, it is evicted. I

Why flush the pipeline for Memory Order Violation caused by other logical processors?

阅读更多关于 Why flush the pipeline for Memory Order Violation caused by other logical processors?

问题 The Memory Order Machine Clear performance event is described by the vTune documentation as: The memory ordering (MO) machine clear happens when a snoop request from another processor matches a source for a data operation in the pipeline. In this situation the pipeline is cleared before the loads and stores in progress are retired. However I don't see why that should be the case. There is no synchronisation order between loads and stores on different logical processors. The processor could

What happens to expected memory semantics (such as read after write) when a thread is scheduled on a different CPU core?

阅读更多关于 What happens to expected memory semantics (such as read after write) when a thread is scheduled on a different CPU core?

问题 Code within a single thread has certain memory guarantees, such as read after write (i.e. writing some value to a memory location, then reading it back should give the value you wrote). What happens to such memory guarantees if a thread is rescheduled to execute on a different CPU core? Say a thread writes 10 to memory location X, then gets rescheduled to a different core. That core's L1 cache might have a different value for X (from another thread that was executing on that core previously),

Are load ops deallocated from the RS when they dispatch, complete or some other time?

阅读更多关于 Are load ops deallocated from the RS when they dispatch, complete or some other time?

问题 On modern Intel 1 x86, are load uops freed from the RS (Reservation Station) at the point they dispatch 2 , or when they complete 3 , or somewhere in-between 4 ? 1 I am also interested in AMD Zen and sequels, so feel free to include that too, but for the purposes of making the question manageable I limit it to Intel. Also, AMD seems to have a somewhat different load pipeline from Intel which may make investigating this on AMD a separate task. 2 Dispatch here means leave the RS for execution.

Are load ops deallocated from the RS when they dispatch, complete or some other time?

阅读更多关于 Are load ops deallocated from the RS when they dispatch, complete or some other time?