simd

Loading data for GCC's vector extensions

强颜欢笑 提交于 2019-12-30 04:36:07
问题 GCC's vector extensions offer a nice, reasonably portable way of accessing some SIMD instructions on different hardware architectures without resorting to hardware specific intrinsics (or auto-vectorization). A real use case, is calculating a simple additive checksum. The one thing that isn't clear is how to safely load data into a vector. typedef char v16qi __attribute__ ((vector_size(16))); static uint8_t checksum(uint8_t *buf, size_t size) { assert(size%16 == 0); uint8_t sum = 0; vec16qi

How to write c++ code that the compiler can efficiently compile to SSE or AVX?

早过忘川 提交于 2019-12-29 07:06:10
问题 Let's say I have a function written in c++ that performs matrix vector multiplications on a lot of vectors. It takes a pointer to the array of vectors to transform. Am I correct to assume that the compiler cannot efficiently optimize that to SIMD instructions because it does not know the alignment of the passed pointer (requiring a 16 byte alignment for SSE or 32 byte alignment for AVX) at compile time? Or is the memory alignment of the data irrelevant for optimal SIMD code and the data

How to write c++ code that the compiler can efficiently compile to SSE or AVX?

瘦欲@ 提交于 2019-12-29 07:05:09
问题 Let's say I have a function written in c++ that performs matrix vector multiplications on a lot of vectors. It takes a pointer to the array of vectors to transform. Am I correct to assume that the compiler cannot efficiently optimize that to SIMD instructions because it does not know the alignment of the passed pointer (requiring a 16 byte alignment for SSE or 32 byte alignment for AVX) at compile time? Or is the memory alignment of the data irrelevant for optimal SIMD code and the data

packing 10 bit values into a byte stream with SIMD

给你一囗甜甜゛ 提交于 2019-12-29 04:55:07
问题 I'm trying to packing 10 bit pixels in to a continuous byte stream, using SIMD instructions. The code below does it "in principle" but the SIMD version is slower than the scalar version. The problem seem to be that I can't find good gather/scatter operations that load the register efficiently. Any suggestions for improvement? // SIMD_test.cpp : Defines the entry point for the console application. // #include "stdafx.h" #include "Windows.h" #include <tmmintrin.h> #include <stdint.h> #include

Is an __m128i variable zero?

给你一囗甜甜゛ 提交于 2019-12-28 21:50:53
问题 How do I test if a __m128i variable has any nonzero value on SSE-2-and-earlier processors? 回答1: In SSE2 you can do: __m128i zero = _mm_setzero_si128(); if(_mm_movemask_epi8(_mm_cmpeq_epi32(x,zero)) == 0xFFFF) { //the code... } this will test four int's vs zero then return a mask for each byte, so your bit-offsets of each corresponding int would be at 0, 4, 8 & 12, but the above test will catch if any bit is set, then if you preserve the mask you can work with the finer grained parts directly

Implementation of __builtin_clz

陌路散爱 提交于 2019-12-28 12:00:12
问题 What is the implementation of GCC's (4.6+) __builtin_clz ? Does it correspond to some CPU instruction on Intel x86_64 (AVX) ? 回答1: It should translate to a Bit Scan Reverse instruction and a subtract. The BSR gives the index of the leading 1, and then you can subtract that from the word size to get the number of leading zeros. Edit: if your CPU supports LZCNT (Leading Zero Count), then that will probably do the trick too, but not all x86-64 chips have that instruction. 回答2: Yes, and no. CLZ

SSE-copy, AVX-copy and std::copy performance

回眸只為那壹抹淺笑 提交于 2019-12-28 10:07:05
问题 I'm tried to improve performance of copy operation via SSE and AVX: #include <immintrin.h> const int sz = 1024; float *mas = (float *)_mm_malloc(sz*sizeof(float), 16); float *tar = (float *)_mm_malloc(sz*sizeof(float), 16); float a=0; std::generate(mas, mas+sz, [&](){return ++a;}); const int nn = 1000;//Number of iteration in tester loops std::chrono::time_point<std::chrono::system_clock> start1, end1, start2, end2, start3, end3; //std::copy testing start1 = std::chrono::system_clock::now();

SSE-copy, AVX-copy and std::copy performance

大城市里の小女人 提交于 2019-12-28 10:06:08
问题 I'm tried to improve performance of copy operation via SSE and AVX: #include <immintrin.h> const int sz = 1024; float *mas = (float *)_mm_malloc(sz*sizeof(float), 16); float *tar = (float *)_mm_malloc(sz*sizeof(float), 16); float a=0; std::generate(mas, mas+sz, [&](){return ++a;}); const int nn = 1000;//Number of iteration in tester loops std::chrono::time_point<std::chrono::system_clock> start1, end1, start2, end2, start3, end3; //std::copy testing start1 = std::chrono::system_clock::now();

Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

妖精的绣舞 提交于 2019-12-28 05:59:08
问题 AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies 1 , but nothing with 64-bit sources. Let's say I need an unsigned multiply with inputs larger than 32-bit, but less or equal to 52-bits - can I simply use the floating point DP multiply or FMA instructions, and will the output be bit-exact when the integer inputs and results can be represented in 52 or fewer bits (i.e., in the range [0, 2

SSE intrinsic functions reference [closed]

五迷三道 提交于 2019-12-28 02:24:26
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . Does anyone know of a reference listing the operation of the SSE intrinsic functions for gcc, i.e. the functions in the <*mmintrin.h> header files? Thanks. 回答1: As well as all the online PDF documentation already mentioned, there is also a very useful utility which summarizes all the instructions and intrinsics