cpu-architecture

Counting number of allocations into the Write Pending Queue - unexpected low result on NV memory

孤人 提交于 2020-04-13 08:06:08
问题 I am trying to use some of the uncore hardware counters, such as: skx_unc_imc0-5::UNC_M_WPQ_INSERTS . It's supposed to count the number of allocations into the Write Pending Queue. The machine has 2 Intel Xeon Gold 5218 CPUs with cascade lake architecture, with 2 memory controllers per CPU. linux version is 5.4.0-3-amd64. I have the following simple loop and I am reading this counter for it. Array elements are 64 byte in size, equal to cache line. for(int i=0; i < 1000000; i++){ array[i]

Why does the latency of the sqrtsd instruction change based on the input? Intel processors

三世轮回 提交于 2020-04-12 16:10:34
问题 Well on the Intel intrinsic guide it is stated that the instruction called "sqrtsd" has a latency of 18 cycles. I tested it with my own program and it is correct if, for example, we take 0.15 as input. But when we take 256 (or any 2^x) number then the latency is only 13. Why is that? One theory I had is that since 13 is the latency of "sqrtss" which is the same as "sqrtsd" but done on 32bits floating points then maybe the processor was smart enough to understand taht 256 can fit in 32 bit and

How to find the Number of physical CPU Cores (not logical SMT hyperthreads) via .NET Core?

只谈情不闲聊 提交于 2020-03-22 08:48:25
问题 I want to detect the number of real physical cores, not logical cores, for workloads that scale negatively when more threads compete for private per-core caches, and/or have high enough IPC that running more than one logical thread per core doesn't increase throughput by more than the increase in threading overhead, especially for problems that don't scale perfectly to lots of cores. Or to put it another way, the number of threads that can run without any of them competing for execution

How to find the Number of physical CPU Cores (not logical SMT hyperthreads) via .NET Core?

妖精的绣舞 提交于 2020-03-22 08:48:11
问题 I want to detect the number of real physical cores, not logical cores, for workloads that scale negatively when more threads compete for private per-core caches, and/or have high enough IPC that running more than one logical thread per core doesn't increase throughput by more than the increase in threading overhead, especially for problems that don't scale perfectly to lots of cores. Or to put it another way, the number of threads that can run without any of them competing for execution

What's the size of a QWORD on a 64-bit machine?

让人想犯罪 __ 提交于 2020-03-15 05:41:08
问题 I'm currently looking to find an answer to the above question. So far I found people saying, that the word size refers to the size of a processor register, which would suggest on a 64-bit machine the word size being 64 bits and thus a QWORD (4 * word) being 256 bits in size. But on the other hand I found sources like this saying the size would be 128 bits (64 bits for 32-bit and doubled this for 64-bit), while even then others suggest the size would be 64 bits. But the last one is somehow

Intel's CLWB instruction invalidating cache lines

允我心安 提交于 2020-03-09 05:34:40
问题 I am trying to find configuration or memory access pattern for Intel's clwb instruction that would not invalidate cache line. I am testing on Intel Xeon Gold 5218 processor with NVDIMMs. Linux version is 5.4.0-3-amd64. I tried using Device−DAX mode and directly mapping this char device to the address space. I also tried adding this non-volatile memory as a new NUMA node and using numactl --membind command to bind memory to it. In both cases when I use clwb to cached address, it is evicted. I

Why flush the pipeline for Memory Order Violation caused by other logical processors?

ぐ巨炮叔叔 提交于 2020-02-28 04:02:50
问题 The Memory Order Machine Clear performance event is described by the vTune documentation as: The memory ordering (MO) machine clear happens when a snoop request from another processor matches a source for a data operation in the pipeline. In this situation the pipeline is cleared before the loads and stores in progress are retired. However I don't see why that should be the case. There is no synchronisation order between loads and stores on different logical processors. The processor could

What happens to expected memory semantics (such as read after write) when a thread is scheduled on a different CPU core?

 ̄綄美尐妖づ 提交于 2020-02-24 11:13:30
问题 Code within a single thread has certain memory guarantees, such as read after write (i.e. writing some value to a memory location, then reading it back should give the value you wrote). What happens to such memory guarantees if a thread is rescheduled to execute on a different CPU core? Say a thread writes 10 to memory location X, then gets rescheduled to a different core. That core's L1 cache might have a different value for X (from another thread that was executing on that core previously),

Are load ops deallocated from the RS when they dispatch, complete or some other time?

主宰稳场 提交于 2020-02-24 00:38:11
问题 On modern Intel 1 x86, are load uops freed from the RS (Reservation Station) at the point they dispatch 2 , or when they complete 3 , or somewhere in-between 4 ? 1 I am also interested in AMD Zen and sequels, so feel free to include that too, but for the purposes of making the question manageable I limit it to Intel. Also, AMD seems to have a somewhat different load pipeline from Intel which may make investigating this on AMD a separate task. 2 Dispatch here means leave the RS for execution.

Are load ops deallocated from the RS when they dispatch, complete or some other time?

怎甘沉沦 提交于 2020-02-24 00:37:41
问题 On modern Intel 1 x86, are load uops freed from the RS (Reservation Station) at the point they dispatch 2 , or when they complete 3 , or somewhere in-between 4 ? 1 I am also interested in AMD Zen and sequels, so feel free to include that too, but for the purposes of making the question manageable I limit it to Intel. Also, AMD seems to have a somewhat different load pipeline from Intel which may make investigating this on AMD a separate task. 2 Dispatch here means leave the RS for execution.