intrinsics | 易学教程

Using STL vector with SIMD intrinsic data type

阅读更多关于 Using STL vector with SIMD intrinsic data type

问题 As the title reads, I am trying to use STL vector with SIMD intrinsic data type. I know it is not a good practice due to the potential overhead of load/store, but I encountered a quite weird fault. Here is the code: #include "immintrin.h" #include <vector> #include <stdio.h> #define VL 8 int main () { std::vector<__m256> vec_1(10); std::vector<__m256> vec_2(10); float * tmp_1 = new float[VL]; printf("vec_1[0]:\n"); _mm256_storeu_ps(tmp_1, vec_1[0]); // seems to go as expected for (int i = 0;

Constexpr and SSE intrinsics

阅读更多关于 Constexpr and SSE intrinsics

问题 Most C++ compilers support SIMD(SSE/AVX) instructions with intrisics like _mm_cmpeq_epi32 My problem with this is that this function is not marked as constexpr , although "semantically" there is no reason for this function to not be constexpr since it is a pure function. Is there any way I could write my own version of (for example) _mm_cmpeq_epi32 that is constexpr ? Obviously I would like that the function at runtime uses the proper asm, I know I can reimplement any SIMD function with slow

clflush to invalidate cache line via C function

阅读更多关于 clflush to invalidate cache line via C function

问题 I am trying to use clflush to manually evicts a cache line in order to determine cache and line sizes. I didn't find any guide on how to use that instruction. All I see, are some codes that use higher level functions for that purpose. There is a kernel function void clflush_cache_range(void *vaddr, unsigned int size) , but still I don't know what to include in my code and how to use that. I don't know what is the size in that function. More than that, how can I be sure that the line is

Matrix Multiplication of size 100*100 using SSE Intrinsics

阅读更多关于 Matrix Multiplication of size 100*100 using SSE Intrinsics

问题 int MAX_DIM = 100; float a[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16))); float b[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16))); float d[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16))); /* * I fill these arrays with some values */ for(int i=0;i<MAX_DIM;i+=1){ for(int j=0;j<MAX_DIM;j+=4){ for(int k=0;k<MAX_DIM;k+=4){ __m128 result = _mm_load_ps(&d[i][j]); __m128 a_line = _mm_load_ps(&a[i][k]); __m128 b_line0 = _mm_load_ps(&b[k][j+0]); __m128 b_line1 = _mm_loadu_ps(&b[k][j+1]); __m128 b_line2

Equivalents to gcc/clang's march=native in other compilers?

阅读更多关于 Equivalents to gcc/clang's march=native in other compilers?

问题 I'd like to know if there are other compilers than gcc and clang that provide something like an -march=native option, and if so, what that option is. I already understand from another question (Automatically building for best available platform in visual c++ (equivalent to gcc's -march=native)) that Microsoft's compilers do not have that option (unless it's implied in the option that activates the SSE2 instruction set, up to and excluding AVX and higher at least). The use case is simple:

Equivalents to gcc/clang's march=native in other compilers?

阅读更多关于 Equivalents to gcc/clang's march=native in other compilers?

Count positives from float vector using _mm_cmpgt_pd

阅读更多关于 Count positives from float vector using _mm_cmpgt_pd

问题 I'm trying to make a program using intrinsics that counts the >0 elements of a float vector, Thank you all for your time. 来源： https://stackoverflow.com/questions/47461547/count-positives-from-float-vector-using-mm-cmpgt-pd

Stack usage with MMX intrinsics and Microsoft C++

阅读更多关于 Stack usage with MMX intrinsics and Microsoft C++

问题 I have an inline assembler loop that cumulatively adds elements from an int32 data array with MMX instructions. In particular, it uses the fact that the MMX registers can accommodate 16 int32s to calculate 16 different cumulative sums in parallel. I would now like to convert this piece of code to MMX intrinsics but I am afraid that I will suffer a performance penalty because one cannot explicitly intruct the compiler to use the 8 MMX registers to accomulate 16 independent sums. Can anybody

Stack usage with MMX intrinsics and Microsoft C++

阅读更多关于 Stack usage with MMX intrinsics and Microsoft C++

Header for _blsr_u64 with Sun supplied GCC on Solaris 11?

阅读更多关于 Header for _blsr_u64 with Sun supplied GCC on Solaris 11?

问题 We've got some code that runs on multiple platforms. The code uses BMI/BMI2 intrinsics when available, like a Core i7 5th gen. GCC supplied by Sun on Solaris 11.3 is defining __BMI__ and __BMI2__ , but its having trouble locating BMI/BMI2 intrinsics: $ cat test.cxx #include <x86intrin.h> int main(int argc, char* argv[]) { unsigned long long t = argc; #if defined(__BMI__) || defined(__BMI2__) t = _blsr_u64(t); #endif return int(t); } $ /bin/g++ -march=native test.cxx -o test.exe test.cxx: In