avx | 易学教程

【转帖】AVX / AVX2 指令编程

阅读更多关于【转帖】AVX / AVX2 指令编程

AVX / AVX2 指令编程 https://zhuanlan.zhihu.com/p/94649418 感觉讲的很好呢我就是还没理解AVX2和AVX512的区别最近在做加密算法的加速，因为有大量基于C的矩阵运算，优化需要用到AVX指令。这文章不是系统介绍，只是普通的入门笔记，主要内容为function的介绍(documentation的汉化)。转载请注明出处，不然我会画圈圈诅咒你以后写不出代码只写得出bug>_<。阅读前需要掌握：基本的C语言理解什么是SIMD 官方资料（能看的话就不要看任何我写的废话了）：关于intel的SSE，AVX，AVX2，AVX512等所有指令中的方法都可以在这里找到： PDF版本： 19.0U1_CPP_Compiler_DGR_0.pdf software.intel.com 在线版（可筛选）： Intrinsics Guide software.intel.com A taste of SIMD / 小试牛刀 - 使用SIMD编程：来源：https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions 未使用SIMD： vectorAdd(const float* a, const float* b, const

AVX指令集及TF对环境要求

阅读更多关于 AVX指令集及TF对环境要求

AVX（Advanced Vector Extensions，高级矢量扩展）指令集借鉴了一些AMD SSE5的设计思路，进行扩展和加强，形成一套新一代的完整SIMD指令集规范。英特尔AVX指令集主要在很多方面得到扩充和加强。关于指令集和AVX指令集指令集是指CPU能执行的所有指令的集合，每一指令对应一种操作，任何程序最终要编译成一条条指令才能让CPU识别并执行。CPU依靠指令来计算和控制系统，所以指令强弱是衡量CPU性能的重要指标，指令集也成为提高CPU效率的有效工具。 CPU都有一个基本的指令集，比如说目前英特尔和AMD的绝大部分处理器都使用的是X86指令集，因为它们都源自于X86架构。但无论CPU有多快，X86指令也只能一次处理一个数据，这样效率就很低下，毕竟在很多应用中，数据都是成组出现的，比如一个点的坐标（XYZ）和颜色（RGB）、多声道音频等。为了提高CPU在某些方面的性能，就必须增加一些特殊的指令满足时代进步的需求，这些新增的指令就构成了扩展指令集。 AVX（Advanced Vector Extensions，高级矢量扩展）指令集借鉴了一些AMD SSE5的设计思路，进行扩展和加强，形成一套新一代的完整SIMD指令集规范。英特尔AVX指令集主要在很多个方面得到扩充和加强。 =============================================

How to use the Intel AVX in Java?

阅读更多关于 How to use the Intel AVX in Java?

问题 How do I use the Intel AVX vector instruction set from Java? It's a simple question but the answer seems to be hard to find. 回答1: As I know, most current Java JVM JITters don't support automatic vectorization or just do that for very simple loops, so you're out of luck. In Mono's .NET implementation there's Mono.Simd for manual vector code emission and then later MS introduced the System.Numeric.Vectors . Unfortunately there's nothing similar in Java. I don't know if Java's vector class is

How to solve “illegal instruction” for vfmadd213ps?

阅读更多关于 How to solve “illegal instruction” for vfmadd213ps?

问题 I have tried AVX intrinsics. But it caused "Unhandled exception at 0x00E01555 in test.exe: 0xC000001D: Illegal Instruction." I used Visual studio 2015. And the exception error is caused at "vfmadd213ps ymm2,ymm1,ymm0" instruction. I have tried set "/arch:AVX" and "/arch:AVX2", but still error caused. Below is my code. #include <immintrin.h> int main(int argc, char *argv[]) { float a[8] = { 0 }; float b[8] = { 0 }; float c[8] = { 0 }; __m256 _a = _mm256_loadu_ps(a); __m256 _b = _mm256_loadu_ps

Mathematical functions for SIMD registers

阅读更多关于 Mathematical functions for SIMD registers

问题 According to https://sourceware.org/glibc/wiki/libmvec GCC has vector implementation of math functions. They can be used by compiler for optimizations, it can be seen in this example: https://godbolt.org/g/IcxtVi, compiler uses some mangled sine function and operates on 4 doubles at a time I know that there are SIMD math libraries that can be used if I need math functions, but I am still interested is there a way to manually call vectorized math functions that already exist in GCC on __m256d

Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

阅读更多关于 Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

问题 In a piece of C++ code that does something similar to (but not exactly) matrix multiplication, I load 4 contiguous doubles into 4 YMM registers like this: # a is a 64-byte aligned array of double __m256d b0 = _mm256_broadcast_sd(&b[4*k+0]); __m256d b1 = _mm256_broadcast_sd(&b[4*k+1]); __m256d b2 = _mm256_broadcast_sd(&b[4*k+2]); __m256d b3 = _mm256_broadcast_sd(&b[4*k+3]); I compiled the code with gcc-4.8.2 on a Sandy Bridge machine. Hardware event counters (Intel PMU) suggests that the CPU

Segmentation fault (core dumped) when using avx on an array allocated with new[]

阅读更多关于 Segmentation fault (core dumped) when using avx on an array allocated with new[]

问题 When I run this code in visual studio 2015, the code works correctly.But the code generates the following error in codeblocks : Segmentation fault(core dumped). I also ran the code in ubuntu with same error. #include <iostream> #include <immintrin.h> struct INFO { unsigned int id = 0; __m256i temp[8]; }; int main() { std::cout<<"Start AVX..."<<std::endl; int _size = 100; INFO *info = new INFO[_size]; for (int i = 0; i<_size; i++) { for (int k = 0; k < 8; k++) { info[i].temp[k] = _mm256_setr

What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?

阅读更多关于 What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?

问题 Say, I want to clear 4 zmm registers. Will the following code provide the fastest speed? vpxorq zmm0, zmm0, zmm0 vpxorq zmm1, zmm1, zmm1 vpxorq zmm2, zmm2, zmm2 vpxorq zmm3, zmm3, zmm3 On AVX2, if I wanted to clear ymm registers, vpxor was fastest, faster than vxorps, since vpxor could run on multiple units. On AVX512, we don't have vpxor for zmm registers, only vpxorq and vpxord. Is that an efficient way to clear a register? Is the CPU smart enough to not make false dependencies on previous

_mm_alignr_epi8 (PALIGNR) equivalent in AVX2

阅读更多关于 _mm_alignr_epi8 (PALIGNR) equivalent in AVX2

问题 In SSE3, the PALIGNR instruction performs the following: PALIGNR concatenates the destination operand (the first operand) and the source operand (the second operand) into an intermediate composite, shifts the composite at byte granularity to the right by a constant immediate, and extracts the right-aligned result into the destination. I'm currently in the midst of porting my SSE4 code to use AVX2 instructions and working on 256bit registers instead of 128bit. Naively, I believed that the

Intel SIMD - How can I check if an __m256* contains any non-zero values

阅读更多关于 Intel SIMD - How can I check if an __m256* contains any non-zero values

问题 I am using the Microsoft Visual Studio compiler. I am trying to find out if a 256 bit vector contains any non-zero values. I have tried res_simd = ! _mm256_testz_ps(*pSrc1, *pSrc1); but it does not work. 回答1: _mm256_testz_ps just tests the sign bits - in order to test the values you'll need to compare against 0 and then extract the resulting mask, e.g. __m256 vcmp = _mm256_cmp_ps(*pSrc1, _mm256_set1_ps(0.0f), _CMP_EQ_OQ); int mask = _mm256_movemask_ps(vcmp); bool any_nz = mask != 0xff; 来源：