intel

MITE (legacy pipeline) used instead of DSB (uops cache) when jump is not quite aligned on 32 bytes

强颜欢笑 提交于 2020-02-24 03:59:12
问题 This question used to be a part of this (now updated) question, but it seems like it should be another question, since it didn't help to get an answer to the other one. My starting point is a loop doing 3 independent additions: for (unsigned long i = 0; i < 2000000000; i++) { asm volatile("" : "+r" (a), "+r" (b), "+r" (c), "+r" (d)); // prevents C compiler from optimizing out adds a = a + d; b = b + d; c = c + d; } When this loop is not unrolled, it executes in 1 cycle (which is to be

MITE (legacy pipeline) used instead of DSB (uops cache) when jump is not quite aligned on 32 bytes

六眼飞鱼酱① 提交于 2020-02-24 03:56:44
问题 This question used to be a part of this (now updated) question, but it seems like it should be another question, since it didn't help to get an answer to the other one. My starting point is a loop doing 3 independent additions: for (unsigned long i = 0; i < 2000000000; i++) { asm volatile("" : "+r" (a), "+r" (b), "+r" (c), "+r" (d)); // prevents C compiler from optimizing out adds a = a + d; b = b + d; c = c + d; } When this loop is not unrolled, it executes in 1 cycle (which is to be

Are load ops deallocated from the RS when they dispatch, complete or some other time?

主宰稳场 提交于 2020-02-24 00:38:11
问题 On modern Intel 1 x86, are load uops freed from the RS (Reservation Station) at the point they dispatch 2 , or when they complete 3 , or somewhere in-between 4 ? 1 I am also interested in AMD Zen and sequels, so feel free to include that too, but for the purposes of making the question manageable I limit it to Intel. Also, AMD seems to have a somewhat different load pipeline from Intel which may make investigating this on AMD a separate task. 2 Dispatch here means leave the RS for execution.

Are load ops deallocated from the RS when they dispatch, complete or some other time?

怎甘沉沦 提交于 2020-02-24 00:37:41
问题 On modern Intel 1 x86, are load uops freed from the RS (Reservation Station) at the point they dispatch 2 , or when they complete 3 , or somewhere in-between 4 ? 1 I am also interested in AMD Zen and sequels, so feel free to include that too, but for the purposes of making the question manageable I limit it to Intel. Also, AMD seems to have a somewhat different load pipeline from Intel which may make investigating this on AMD a separate task. 2 Dispatch here means leave the RS for execution.

Dynamically determining where a rogue AVX-512 instruction is executing

有些话、适合烂在心里 提交于 2020-02-20 06:35:13
问题 I have a process running on an Intel machine that supports AVX-512, but this process doesn't directly use any AVX-512 instructions (asm or intrinsics) and is compiled with -mno-avx512f so that the compiler doesn't insert any AVX-512 instructions. Yet, it is running indefinitely at the reduced AVX turbo frequency. No doubt there is an AVX-512 instruction sneaking in somewhere, via a library, (very unlikely) system call or something like that. Rather than try to "binary search" down where the

Dynamically determining where a rogue AVX-512 instruction is executing

会有一股神秘感。 提交于 2020-02-20 06:35:09
问题 I have a process running on an Intel machine that supports AVX-512, but this process doesn't directly use any AVX-512 instructions (asm or intrinsics) and is compiled with -mno-avx512f so that the compiler doesn't insert any AVX-512 instructions. Yet, it is running indefinitely at the reduced AVX turbo frequency. No doubt there is an AVX-512 instruction sneaking in somewhere, via a library, (very unlikely) system call or something like that. Rather than try to "binary search" down where the

amd and intel programmer's model compatibility

泄露秘密 提交于 2020-02-04 05:08:28
问题 I have read through Intel's Software Development Guide's (vol 1-3). Without doing a doing a similar read through AMD's Programming Guides (vol 1-5), I am wondering what aspects of Intel and AMD's programming model are the same. Of course, even within a family of processors, there will be model-specific registers and support for various extensions and functionality. However, Intel does make some general statements about simple things that, in general, I am unsure if they carry to AMD. For

Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

拟墨画扇 提交于 2020-01-28 08:03:21
问题 In a piece of C++ code that does something similar to (but not exactly) matrix multiplication, I load 4 contiguous doubles into 4 YMM registers like this: # a is a 64-byte aligned array of double __m256d b0 = _mm256_broadcast_sd(&b[4*k+0]); __m256d b1 = _mm256_broadcast_sd(&b[4*k+1]); __m256d b2 = _mm256_broadcast_sd(&b[4*k+2]); __m256d b3 = _mm256_broadcast_sd(&b[4*k+3]); I compiled the code with gcc-4.8.2 on a Sandy Bridge machine. Hardware event counters (Intel PMU) suggests that the CPU

OpenMP 4.0 for accelerators: Nvidia GPU target

◇◆丶佛笑我妖孽 提交于 2020-01-25 18:08:26
问题 I'm trying to use openMP for accelerators (openMP 4.0) in Visual Studio 2012, using the Intel C++ 15.0 compiler. My accelerator is an Nvidia GeForce GTX 670. This code does not compile: #include <stdio.h> #include<iostream> #include <omp.h> using namespace std; int main(){ #pragma omp target #pragma omp parallel for for (int i=0; i<1000; i++) cout<<"Hello world, i am number "<< i <<endl; } Of course, everything goes fine when I comment the #pragma omp target line. I get the same problem when

OpenMP 4.0 for accelerators: Nvidia GPU target

自闭症网瘾萝莉.ら 提交于 2020-01-25 18:06:06
问题 I'm trying to use openMP for accelerators (openMP 4.0) in Visual Studio 2012, using the Intel C++ 15.0 compiler. My accelerator is an Nvidia GeForce GTX 670. This code does not compile: #include <stdio.h> #include<iostream> #include <omp.h> using namespace std; int main(){ #pragma omp target #pragma omp parallel for for (int i=0; i<1000; i++) cout<<"Hello world, i am number "<< i <<endl; } Of course, everything goes fine when I comment the #pragma omp target line. I get the same problem when