intel | 易学教程

Does the store buffer hold physical or virtual addresses on modern x86?

阅读更多关于 Does the store buffer hold physical or virtual addresses on modern x86?

问题 Modern Intel and AMD chips have large store buffers to buffer stores before commit to the L1 cache. Conceptually, these entries hold the store data and store address. For the address part, do these buffer entries hold virtual or physical addresses, or both? 来源： https://stackoverflow.com/questions/61190976/does-the-store-buffer-hold-physical-or-virtual-addresses-on-modern-x86

Counting number of allocations into the Write Pending Queue - unexpected low result on NV memory

阅读更多关于 Counting number of allocations into the Write Pending Queue - unexpected low result on NV memory

问题 I am trying to use some of the uncore hardware counters, such as: skx_unc_imc0-5::UNC_M_WPQ_INSERTS . It's supposed to count the number of allocations into the Write Pending Queue. The machine has 2 Intel Xeon Gold 5218 CPUs with cascade lake architecture, with 2 memory controllers per CPU. linux version is 5.4.0-3-amd64. I have the following simple loop and I am reading this counter for it. Array elements are 64 byte in size, equal to cache line. for(int i=0; i < 1000000; i++){ array[i]

Why does the latency of the sqrtsd instruction change based on the input? Intel processors

阅读更多关于 Why does the latency of the sqrtsd instruction change based on the input? Intel processors

问题 Well on the Intel intrinsic guide it is stated that the instruction called "sqrtsd" has a latency of 18 cycles. I tested it with my own program and it is correct if, for example, we take 0.15 as input. But when we take 256 (or any 2^x) number then the latency is only 13. Why is that? One theory I had is that since 13 is the latency of "sqrtss" which is the same as "sqrtsd" but done on 32bits floating points then maybe the processor was smart enough to understand taht 256 can fit in 32 bit and

Invalid framebuffer operation after glCheckFramebufferStatus

阅读更多关于 Invalid framebuffer operation after glCheckFramebufferStatus

问题 I am getting a weird OpenGL error when running my application on my HD4000 (Windows 64bit, driver version 15.28.20.64.3347). I boiled it down to a few OpenGL calls to reproduce it: Create two framebuffer objects. Create a texture and bind it as GL_COLOR_ATTACHMENT0 to both FBOs. Call glTexImage2D a second time on the texture Bind the first FBO and call glCheckFramebufferStatus (returns GL_FRAMEBUFFER_COMPLETE). Bind the second FBO and call glClear. The glClear gives an GL_INVALID_FRAMEBUFFER

AVX 256-bit vectors slightly slower than scalar (~10%) for STREAM-like double add loop on huge arrays, on Xeon Gold

阅读更多关于 AVX 256-bit vectors slightly slower than scalar (~10%) for STREAM-like double add loop on huge arrays, on Xeon Gold

问题 I am new to AVX512 instruction set and I write the following code as demo. #include <iostream> #include <array> #include <chrono> #include <vector> #include <cstring> #include <omp.h> #include <immintrin.h> #include <cstdlib> int main() { unsigned long m, n, k; m = n = k = 1 << 30; auto *a = static_cast<double*>(aligned_alloc(512, m*sizeof(double))); auto *b = static_cast<double*>(aligned_alloc(512, n*sizeof(double))); auto *c = static_cast<double*>(aligned_alloc(512, k*sizeof(double)));

Why both? vperm2f128 (avx) vs vperm2i128 (avx2)

阅读更多关于 Why both? vperm2f128 (avx) vs vperm2i128 (avx2)

问题 avx introduced the instruction vperm2f128 (exposed via _mm256_permute2f128_si256 ), while avx2 introduced vperm2i128 (exposed via _mm256_permute2x128_si256 ). They both seem to be doing exactly the same, and their respective latencies and throughputs also seem to be identical. So why do both instructions exist? There has to be some reasoning behind that? Is there maybe something I have overlooked? Given that avx2 operates on data structures introduced with avx, I cannot imagine that a

Why both? vperm2f128 (avx) vs vperm2i128 (avx2)

阅读更多关于 Why both? vperm2f128 (avx) vs vperm2i128 (avx2)

Why does using MFENCE with store instruction block prefetching in L1 cache?

阅读更多关于 Why does using MFENCE with store instruction block prefetching in L1 cache?

问题 I have an object of 64 byte in size: typedef struct _object{ int value; char pad[60]; } object; in main I am initializing array of object: volatile object * array; int arr_size = 1000000; array = (object *) malloc(arr_size * sizeof(object)); for(int i=0; i < arr_size; i++){ array[i].value = 1; _mm_clflush(&array[i]); } _mm_mfence(); Then loop again through each element. This is the loop I am counting events for: int tmp; for(int i=0; i < arr_size-105; i++){ array[i].value = 2; //tmp = array[i

Intel's CLWB instruction invalidating cache lines

阅读更多关于 Intel's CLWB instruction invalidating cache lines

问题 I am trying to find configuration or memory access pattern for Intel's clwb instruction that would not invalidate cache line. I am testing on Intel Xeon Gold 5218 processor with NVDIMMs. Linux version is 5.4.0-3-amd64. I tried using Device−DAX mode and directly mapping this char device to the address space. I also tried adding this non-volatile memory as a new NUMA node and using numactl --membind command to bind memory to it. In both cases when I use clwb to cached address, it is evicted. I

Unrolling 1-cycle loop reduces performance by 25% on Skylake. uops scheduling issue?

阅读更多关于 Unrolling 1-cycle loop reduces performance by 25% on Skylake. uops scheduling issue?

问题 TL;DR I have a loop that takes 1 cycle to execute on Skylake (it does 3 additions + 1 inc/jump). When I unroll it more than 2 times (no matter how much), my program runs about 25% slower. It might have something to do with alignment, but I don't clearly see what. EDIT: this question used to ask about why uops were delivered by the DSB rather than the MITE. This has now be moved to this question. I was trying to benchmark a loop which does 3 additions on my Skylake. This loop should execute in