avx

SSE - AVX conversion from double to char

梦想的初衷 提交于 2019-12-06 12:41:57
问题 I want to convert a vector of double precision values to char. I have to make two distinct approaches, one for SSE2 and the other for AVX2. I started with AVX2. __m128i sub_proc(__m256d& in) { __m256d _zero_pd = _mm256_setzero_pd(); __m256d ih_pd = _mm256_unpackhi_pd(in,_zero_pd); __m256d il_pd = _mm256_unpacklo_pd(in,_zero_pd); __m128i ih_si = _mm256_cvtpd_epi32(ih_pd); __m128i il_si = _mm256_cvtpd_epi32(il_pd); ih_si = _mm_shuffle_epi32(ih_si,_MM_SHUFFLE(3,1,2,0)); il_si = _mm_shuffle_epi32

Is _mm256_store_ps() function is atomic ? while using alongside openmp

最后都变了- 提交于 2019-12-06 12:26:42
I am trying to create a simple program that uses Intel's AVX technology and perform vector multiplication and addition. Here I am using Open MP alongside this. But it is getting segmentation fault due to the function call _mm256_store_ps(). I have tried with OpenMP atomic features like atomic, critical, etc so that if this function is atomic in nature and multiple cores are attempting to execute at the same time, but it is not working. #include<stdio.h> #include<time.h> #include<stdlib.h> #include<immintrin.h> #include<omp.h> #define N 64 __m256 multiply_and_add_intel(__m256 a, __m256 b, _

Passing types containing SSE/AVX values

一笑奈何 提交于 2019-12-06 06:06:14
Let's say I have the following struct A { __m256 a; } struct B { __m256 a; float b; } Which of the following's generally better (if any and why) in a hard core loop? void f0(A a) { ... } void f1(A& a) { ... } //and the pointer variation void f2(B b) { ...} void f3(B& b) { ... } //and the pointer variation The answer is that it doesn't matter. According to this: http://msdn.microsoft.com/en-us/library/ms235286.aspx The calling convention states that 16-byte (and probably 32-byte) operands are always passed by reference. So even if you to pass by value, the compiler will pass it by reference

Avoiding AVX-SSE (VEX) Transition Penalties

荒凉一梦 提交于 2019-12-06 06:01:40
问题 Our 64-bit application has lots of code (inter alia, in standard libraries) that use xmm0-xmm7 registers in SSE mode. I would like to implement fast memory copy using ymm registers. I cannot modify all the code that uses xmm registers to add VEX prefix, and I also think that this is not practical, since it will increase the size of the code can make it run slower because of the need for the CPU to decode larger instructions. I just want to use two ymm registers (and possibly zmm - the

AVX convert 64 bit integer to 64 bit float

寵の児 提交于 2019-12-06 05:51:26
问题 I would like to convert 4 packed 64 bit integers to 4 packed 64 bit floats using AVX. I've tried something like: int_64t *ls = (int64_t *) _mm_malloc(256, 32); ls[0] = a; //... ls[3] = d; __mm256i packed = _mm256_load_si256((__m256i const *)ls); Which will display in the debugger: (gdb) print packed $4 = {1234, 5678, 9012, 3456} Okay so far, but the only cast/conversion operation that I can find is _mm256i_castsi256_pd, which doesn't get me what I want: __m256d pd = _mm256_castsi256_pd(packed

How to perform element-wise left shift with __m128i?

﹥>﹥吖頭↗ 提交于 2019-12-06 04:49:38
问题 The SSE shift instructions I have found can only shift by the same amount on all the elements: _mm_sll_epi32() _mm_slli_epi32() These shift all elements, but by the same shift amount. Is there a way to apply different shifts to the different elements? Something like this: __m128i a, __m128i b; r0:= a0 << b0; r1:= a1 << b1; r2:= a2 << b2; r3:= a3 << b3; 回答1: There exists the _mm_shl_epi32() intrinsic that does exactly that. http://msdn.microsoft.com/en-us/library/gg445138.aspx However, it

Segmentation fault (core dumped) when using avx on an array allocated with new[]

孤街浪徒 提交于 2019-12-06 04:39:30
When I run this code in visual studio 2015, the code works correctly.But the code generates the following error in codeblocks : Segmentation fault(core dumped). I also ran the code in ubuntu with same error. #include <iostream> #include <immintrin.h> struct INFO { unsigned int id = 0; __m256i temp[8]; }; int main() { std::cout<<"Start AVX..."<<std::endl; int _size = 100; INFO *info = new INFO[_size]; for (int i = 0; i<_size; i++) { for (int k = 0; k < 8; k++) { info[i].temp[k] = _mm256_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,

How to speed up calculation of integral image?

烈酒焚心 提交于 2019-12-06 03:22:21
问题 I often need to calculate integral image. This is simple algorithm: uint32_t void integral_sum(const uint8_t * src, size_t src_stride, size_t width, size_t height, uint32_t * sum, size_t sum_stride) { memset(sum, 0, (width + 1) * sizeof(uint32_t)); sum += sum_stride + 1; for (size_t row = 0; row < height; row++) { uint32_t row_sum = 0; sum[-1] = 0; for (size_t col = 0; col < width; col++) { row_sum += src[col]; sum[col] = row_sum + sum[col - sum_stride]; } src += src_stride; sum += sum_stride

MinGW64 Is Incapable of 32 Byte Stack Alignment (Required for AVX on Windows x64), Easy Work Around or Switch Compilers?

て烟熏妆下的殇ゞ 提交于 2019-12-06 02:34:01
问题 I'm trying to work with AVX instructions and windows 64bit. I'm comfortable with g++ compiler so I've been using that, however, there is a big bug described reported here and very rough solutions were presented here. Basically, m256 variable can't be aligned on the stack to work properly with avx instructions, it needs 32 byte alignment. The solutions presented at the other stack question I linked are really terrible, especially if you have performance in mind. A python program that you would

QWORD shuffle sequential 7-bits to byte-alignment with SIMD SSE…AVX

泪湿孤枕 提交于 2019-12-06 02:09:45
I would like to know if the following is possible in any of the SIMD families of instructions. I have a qword input with 63 significant bits (never negative). Each sequential 7 bits starting from the LSB is shuffle-aligned to a byte, with a left-padding of 1 (except for the most significant non-zero byte). To illustrate, I'll use letters for clarity's sake. The result is only the significant bytes, thus 0 - 9 in size, which is converted to a byte array. In: 0|kjihgfe|dcbaZYX|WVUTSRQ|PONMLKJ|IHGFEDC|BAzyxwv|utsrqpo|nmlkjih|gfedcba Out: 0kjihgfe|1dcbaZYX|1WVUTSRQ|1PONMLKJ|1IHGFEDC|1BAzyxwv