simd | 易学教程

SSE slower than FPU?

阅读更多关于 SSE slower than FPU?

问题 I have a large piece of code, part of whose body contains this piece of code: result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1); which I have vectorized as follows (everything is already a float ): __m128 r = _mm_mul_ps(_mm_set_ps(ny, nx, ny, nx), _mm_set_ps(ny, nx, m_Ly, m_Lx)); __declspec(align(16)) int asInt[4] = { _mm_extract_ps(r,0), _mm_extract_ps(r,1), _mm_extract_ps(r,2), _mm_extract_ps(r,3) }; float (&res)[4] = reinterpret_cast<float (&)[4]>(asInt); result = (res

How to check if compiled code uses SSE and AVX instructions?

阅读更多关于 How to check if compiled code uses SSE and AVX instructions?

问题 I wrote some code to do a bunch of math, and it needs to go fast, so I need it to use SSE and AVX instructions. I'm compiling it using g++ with the flags -O3 and -march=native , so I think it's using SSE and AVX instructions, but I'm not sure. Most of my code looks something like the following: for(int i = 0;i<size;i++){ a[i] = b[i] * c[i]; } Is there any way I can tell if my code (after compilation) uses SSE and AVX instructions? I think I could look at the assembly to see, but I don't know

How to check if compiled code uses SSE and AVX instructions?

阅读更多关于 How to check if compiled code uses SSE and AVX instructions?

does rewriting memcpy/memcmp/… with SIMD instructions make sense

阅读更多关于 does rewriting memcpy/memcmp/… with SIMD instructions make sense

问题 Does rewriting memcpy/memcmp/... with SIMD instructions make sense in a large scale software? If so, why gcc doesn't generate simd instructions for these library functions by default. Also, are there any other functions can be possibly improved by SIMD? 回答1: Yes, these functions are much faster with SSE instructions. It would be nice if your runtime library/compiler instrinsics would include optimized versions, but that doesn't seem to be pervasive. I have a custom SIMD memchr which is a hell

Checking if SSE is supported at runtime [duplicate]

阅读更多关于 Checking if SSE is supported at runtime [duplicate]

问题 This question already has answers here : How to check if a CPU supports the SSE3 instruction set? (5 answers) cpu dispatcher for visual studio for AVX and SSE (3 answers) Closed 4 years ago . I would like to check if SSE4 or AVX is supported at runtime, so that my program may take advantage of processor specific instructions without creating a binary for each processor. If I could determine it at runtime, I could use an interface and switch between different instruction sets. 回答1: GCC has a

c++ SSE SIMD framework [closed]

阅读更多关于 c++ SSE SIMD framework [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . Does anyone know an open-source C++ x86 SIMD intrinsics library? Intel supplies exactly what I need in their integrated performance primitives library, but I can't use that because of the copyrights all over the place. EDIT I already know the intrinsics provided by the compilers. What I need is a convenient

Ineffective remainder loop in my code

阅读更多关于 Ineffective remainder loop in my code

问题 I have this function: bool interpolate(const Mat &im, float ofsx, float ofsy, float a11, float a12, float a21, float a22, Mat &res) { bool ret = false; // input size (-1 for the safe bilinear interpolation) const int width = im.cols-1; const int height = im.rows-1; // output size const int halfWidth = res.cols >> 1; const int halfHeight = res.rows >> 1; float *out = res.ptr<float>(0); const float *imptr = im.ptr<float>(0); for (int j=-halfHeight; j<=halfHeight; ++j) { const float rx = ofsx +

why is strchr twice as fast as my simd code

阅读更多关于 why is strchr twice as fast as my simd code

问题 I am learning SIMD and was curious to see whether it was possible to beat strchr at finding a character. It appears that strchr uses the same intrinsics but I assume that it checks for a null, whereas I know the character is in the array and plan on avoiding a null check. My code is: size_t N = 1e9; bool found = false; //Not really used ... size_t char_index1 = 0; size_t char_index2 = 0; char * str = malloc(N); memset(str,'a',N); __m256i char_match; __m256i str_simd; __m256i result; __m256i*

C++ SSE filter implementation

阅读更多关于 C++ SSE filter implementation

问题 I tried to use SSE to do 4 pixels operation. I have problem in loading the image data to __m128. My image data is a char buffer. Let say my image is 1024 x1024. My filter is 16x16. __m128 IMG_VALUES, FIL_VALUES, NEW_VALUES; //ok: IMG_VALUES=_mm_load_ps(&pInput[0]); //hang below: IMG_VALUES=_mm_load_ps(&pInput[1]); I dont know how to handle index 1,2,3... thanks. 回答1: If you really need to do this with floating point rather then integer/fixed point then you will need to load your 8 bit data,

ternary operator for clang's extended vectors

阅读更多关于 ternary operator for clang's extended vectors

问题 I've tried playing with clang's extended vectors. The ternary operator is supposed to work, but it is not working for me. Example: int main() { using int4 = int __attribute__((ext_vector_type(4))); int4 a{0, 1, 3, 4}; int4 b{2, 1, 4, 5}; auto const r(a - b ? a : b); return 0; } Please provide examples on how I might make it work, like it works under OpenCL. I am using clang-3.4.2 . Error: t.cpp:8:16: error: value of type 'int __attribute__((ext_vector_type(4)))' is not contextually