intel | 易学教程

MITE (legacy pipeline) used instead of DSB (uops cache) when jump is not quite aligned on 32 bytes

阅读更多关于 MITE (legacy pipeline) used instead of DSB (uops cache) when jump is not quite aligned on 32 bytes

问题 This question used to be a part of this (now updated) question, but it seems like it should be another question, since it didn't help to get an answer to the other one. My starting point is a loop doing 3 independent additions: for (unsigned long i = 0; i < 2000000000; i++) { asm volatile("" : "+r" (a), "+r" (b), "+r" (c), "+r" (d)); // prevents C compiler from optimizing out adds a = a + d; b = b + d; c = c + d; } When this loop is not unrolled, it executes in 1 cycle (which is to be

MITE (legacy pipeline) used instead of DSB (uops cache) when jump is not quite aligned on 32 bytes

阅读更多关于 MITE (legacy pipeline) used instead of DSB (uops cache) when jump is not quite aligned on 32 bytes

Are load ops deallocated from the RS when they dispatch, complete or some other time?

阅读更多关于 Are load ops deallocated from the RS when they dispatch, complete or some other time?

问题 On modern Intel 1 x86, are load uops freed from the RS (Reservation Station) at the point they dispatch 2 , or when they complete 3 , or somewhere in-between 4 ? 1 I am also interested in AMD Zen and sequels, so feel free to include that too, but for the purposes of making the question manageable I limit it to Intel. Also, AMD seems to have a somewhat different load pipeline from Intel which may make investigating this on AMD a separate task. 2 Dispatch here means leave the RS for execution.

Are load ops deallocated from the RS when they dispatch, complete or some other time?

阅读更多关于 Are load ops deallocated from the RS when they dispatch, complete or some other time?

Dynamically determining where a rogue AVX-512 instruction is executing

阅读更多关于 Dynamically determining where a rogue AVX-512 instruction is executing

问题 I have a process running on an Intel machine that supports AVX-512, but this process doesn't directly use any AVX-512 instructions (asm or intrinsics) and is compiled with -mno-avx512f so that the compiler doesn't insert any AVX-512 instructions. Yet, it is running indefinitely at the reduced AVX turbo frequency. No doubt there is an AVX-512 instruction sneaking in somewhere, via a library, (very unlikely) system call or something like that. Rather than try to "binary search" down where the

Dynamically determining where a rogue AVX-512 instruction is executing

阅读更多关于 Dynamically determining where a rogue AVX-512 instruction is executing

amd and intel programmer's model compatibility

阅读更多关于 amd and intel programmer's model compatibility

问题 I have read through Intel's Software Development Guide's (vol 1-3). Without doing a doing a similar read through AMD's Programming Guides (vol 1-5), I am wondering what aspects of Intel and AMD's programming model are the same. Of course, even within a family of processors, there will be model-specific registers and support for various extensions and functionality. However, Intel does make some general statements about simple things that, in general, I am unsure if they carry to AMD. For

Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

阅读更多关于 Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

问题 In a piece of C++ code that does something similar to (but not exactly) matrix multiplication, I load 4 contiguous doubles into 4 YMM registers like this: # a is a 64-byte aligned array of double __m256d b0 = _mm256_broadcast_sd(&b[4*k+0]); __m256d b1 = _mm256_broadcast_sd(&b[4*k+1]); __m256d b2 = _mm256_broadcast_sd(&b[4*k+2]); __m256d b3 = _mm256_broadcast_sd(&b[4*k+3]); I compiled the code with gcc-4.8.2 on a Sandy Bridge machine. Hardware event counters (Intel PMU) suggests that the CPU

OpenMP 4.0 for accelerators: Nvidia GPU target

阅读更多关于 OpenMP 4.0 for accelerators: Nvidia GPU target

问题 I'm trying to use openMP for accelerators (openMP 4.0) in Visual Studio 2012, using the Intel C++ 15.0 compiler. My accelerator is an Nvidia GeForce GTX 670. This code does not compile: #include <stdio.h> #include<iostream> #include <omp.h> using namespace std; int main(){ #pragma omp target #pragma omp parallel for for (int i=0; i<1000; i++) cout<<"Hello world, i am number "<< i <<endl; } Of course, everything goes fine when I comment the #pragma omp target line. I get the same problem when

OpenMP 4.0 for accelerators: Nvidia GPU target

阅读更多关于 OpenMP 4.0 for accelerators: Nvidia GPU target