intrinsics

Using STL vector with SIMD intrinsic data type

China☆狼群 提交于 2020-01-13 03:14:13
问题 As the title reads, I am trying to use STL vector with SIMD intrinsic data type. I know it is not a good practice due to the potential overhead of load/store, but I encountered a quite weird fault. Here is the code: #include "immintrin.h" #include <vector> #include <stdio.h> #define VL 8 int main () { std::vector<__m256> vec_1(10); std::vector<__m256> vec_2(10); float * tmp_1 = new float[VL]; printf("vec_1[0]:\n"); _mm256_storeu_ps(tmp_1, vec_1[0]); // seems to go as expected for (int i = 0;

Constexpr and SSE intrinsics

半世苍凉 提交于 2020-01-12 07:20:31
问题 Most C++ compilers support SIMD(SSE/AVX) instructions with intrisics like _mm_cmpeq_epi32 My problem with this is that this function is not marked as constexpr , although "semantically" there is no reason for this function to not be constexpr since it is a pure function. Is there any way I could write my own version of (for example) _mm_cmpeq_epi32 that is constexpr ? Obviously I would like that the function at runtime uses the proper asm, I know I can reimplement any SIMD function with slow

clflush to invalidate cache line via C function

二次信任 提交于 2020-01-08 14:38:06
问题 I am trying to use clflush to manually evicts a cache line in order to determine cache and line sizes. I didn't find any guide on how to use that instruction. All I see, are some codes that use higher level functions for that purpose. There is a kernel function void clflush_cache_range(void *vaddr, unsigned int size) , but still I don't know what to include in my code and how to use that. I don't know what is the size in that function. More than that, how can I be sure that the line is

Matrix Multiplication of size 100*100 using SSE Intrinsics

狂风中的少年 提交于 2020-01-06 06:51:45
问题 int MAX_DIM = 100; float a[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16))); float b[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16))); float d[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16))); /* * I fill these arrays with some values */ for(int i=0;i<MAX_DIM;i+=1){ for(int j=0;j<MAX_DIM;j+=4){ for(int k=0;k<MAX_DIM;k+=4){ __m128 result = _mm_load_ps(&d[i][j]); __m128 a_line = _mm_load_ps(&a[i][k]); __m128 b_line0 = _mm_load_ps(&b[k][j+0]); __m128 b_line1 = _mm_loadu_ps(&b[k][j+1]); __m128 b_line2

Equivalents to gcc/clang's march=native in other compilers?

回眸只為那壹抹淺笑 提交于 2020-01-05 23:00:04
问题 I'd like to know if there are other compilers than gcc and clang that provide something like an -march=native option, and if so, what that option is. I already understand from another question (Automatically building for best available platform in visual c++ (equivalent to gcc's -march=native)) that Microsoft's compilers do not have that option (unless it's implied in the option that activates the SSE2 instruction set, up to and excluding AVX and higher at least). The use case is simple:

Equivalents to gcc/clang's march=native in other compilers?

时光毁灭记忆、已成空白 提交于 2020-01-05 22:54:19
问题 I'd like to know if there are other compilers than gcc and clang that provide something like an -march=native option, and if so, what that option is. I already understand from another question (Automatically building for best available platform in visual c++ (equivalent to gcc's -march=native)) that Microsoft's compilers do not have that option (unless it's implied in the option that activates the SSE2 instruction set, up to and excluding AVX and higher at least). The use case is simple:

Stack usage with MMX intrinsics and Microsoft C++

江枫思渺然 提交于 2020-01-05 07:09:32
问题 I have an inline assembler loop that cumulatively adds elements from an int32 data array with MMX instructions. In particular, it uses the fact that the MMX registers can accommodate 16 int32s to calculate 16 different cumulative sums in parallel. I would now like to convert this piece of code to MMX intrinsics but I am afraid that I will suffer a performance penalty because one cannot explicitly intruct the compiler to use the 8 MMX registers to accomulate 16 independent sums. Can anybody

Stack usage with MMX intrinsics and Microsoft C++

六月ゝ 毕业季﹏ 提交于 2020-01-05 07:09:02
问题 I have an inline assembler loop that cumulatively adds elements from an int32 data array with MMX instructions. In particular, it uses the fact that the MMX registers can accommodate 16 int32s to calculate 16 different cumulative sums in parallel. I would now like to convert this piece of code to MMX intrinsics but I am afraid that I will suffer a performance penalty because one cannot explicitly intruct the compiler to use the 8 MMX registers to accomulate 16 independent sums. Can anybody

Header for _blsr_u64 with Sun supplied GCC on Solaris 11?

柔情痞子 提交于 2020-01-04 05:21:14
问题 We've got some code that runs on multiple platforms. The code uses BMI/BMI2 intrinsics when available, like a Core i7 5th gen. GCC supplied by Sun on Solaris 11.3 is defining __BMI__ and __BMI2__ , but its having trouble locating BMI/BMI2 intrinsics: $ cat test.cxx #include <x86intrin.h> int main(int argc, char* argv[]) { unsigned long long t = argc; #if defined(__BMI__) || defined(__BMI2__) t = _blsr_u64(t); #endif return int(t); } $ /bin/g++ -march=native test.cxx -o test.exe test.cxx: In