sse

SSE (SIMD): multiply vector by scalar

£可爱£侵袭症+ 提交于 2019-12-03 15:16:48
问题 A common operation I do in my program is scaling vectors by a scalar (V*s, e.g. [1,2,3,4]*2 == [2,4,6,8]). Is there a SSE (or AVX) instruction to do this, other than first loading the scalar in every position in a vector (e.g. _mm_set_ps(2,2,2,2)) and then multiplying? This is what I do now: __m128 _scalar = _mm_set_ps(s,s,s,s); __m128 _result = _mm_mul_ps(_vector, _scalar); I'm looking for something like... __m128 _result = _mm_scale_ps(_vector, s); 回答1: Depending on your compiler you may be

Bilinear filter with SSE4.1 intrinsics

南楼画角 提交于 2019-12-03 14:42:10
I am trying to figure out a reasonably fast bilinear filtering function just for one filtered sample at a time now as an exercise in getting used to using intrinsics - up to SSE41 is fine. So far I have the following: inline __m128i DivideBy255_8xUint16(const __m128i value) { // Blinn 16bit divide by 255 trick but across 8 packed 16bit values const __m128i plus128 = _mm_add_epi16(value, _mm_set1_epi16(128)); const __m128i plus128ThenDivideBy256 = _mm_srli_epi16(plus128, 8); // TODO: Should this be an arithmetic or logical shift or does it matter? const __m128i partial = _mm_add_epi16(plus128,

Where does the SSE instructions outperform normal instructions

大城市里の小女人 提交于 2019-12-03 14:33:35
问题 Where does the x86-64's SSE instructions (vector instructions) outperform the normal instructions. Because what I'm seeing is that the frequent loads and stores that are required for executing SSE instructions is nullifying any gain we have due to vector calculation. So could someone give me an example SSE code where it performs better than the normal code. Its maybe because I am passing each parameter separately, like this... __m128i a = _mm_set_epi32(pa[0], pa[1], pa[2], pa[3]); __m128i b =

Constexpr and SSE intrinsics

陌路散爱 提交于 2019-12-03 12:56:16
Most C++ compilers support SIMD(SSE/AVX) instructions with intrisics like _mm_cmpeq_epi32 My problem with this is that this function is not marked as constexpr , although "semantically" there is no reason for this function to not be constexpr since it is a pure function. Is there any way I could write my own version of (for example) _mm_cmpeq_epi32 that is constexpr ? Obviously I would like that the function at runtime uses the proper asm, I know I can reimplement any SIMD function with slow function that is constexpr . If you wonder why I care about constexpr of SIMD functions. Non

SSE: unaligned load and store that crosses page boundary

回眸只為那壹抹淺笑 提交于 2019-12-03 12:49:28
问题 I read somewhere that before performing unaligned load or store next to page boundary (e.g. using _mm_loadu_si128 / _mm_storeu_si128 intrinsics), code should first check if whole vector (in this case 16 bytes) belongs to the same page, and switch to non-vector instructions if not. I understand that this is needed to prevent coredump if next page does not belong to process. But what if both pages belongs to process (e.g. they are part of one buffer, and I know size of that buffer)? I wrote

What is my compiler doing? (optimizing memcpy)

吃可爱长大的小学妹 提交于 2019-12-03 12:38:45
I'm compiling a bit of code using the following settings in VC++2010: /O2 /Ob2 /Oi /Ot However I'm having some trouble understanding some parts of the assembly generated, I have put some questions in the code as comments. Also, what prefetching distance is generally recommended on modern cpus? I can ofc test on my own cpu, but I was hoping for some value that will work well on a wider range of cpus. Maybe one could use dynamic prefetching distances? <--EDIT: Another thing I'm surprised about is that the compiler does not interleave in some form the movdqa and movntdq instructions? Since these

pthreads v. SSE weak memory ordering

♀尐吖头ヾ 提交于 2019-12-03 12:28:40
Do the Linux glibc pthread functions on x86_64 act as fences for weakly-ordered memory accesses? (pthread_mutex_lock/unlock are the exact functions I'm interested in). SSE2 provides some instructions with weak memory ordering (non-temporal stores such as movntps in particular). If you are using these instructions and want to guarantee that another thread/core sees an ordering, then I understand you need an explicit fence for this, e.g., a sfence instruction. Normally you do expect the pthread API to act as a fence appropriately. However, I suspect normal C code on x86 will not generate weakly

128-bit values - From XMM registers to General Purpose

徘徊边缘 提交于 2019-12-03 12:21:35
I have a couple of questions related to moving XMM values to general purpose registers. All the questions found on SO focus on the opposite, namely transfering values in gp registers to XMM. How can I move an XMM register value (128-bit) to two 64-bit general purpose registers? movq RAX XMM1 ; 0th bit to 63th bit mov? RCX XMM1 ; 64th bit to 127th bit Similarly, how can I move an XMM register value (128-bit) to four 32-bit general purpose registers? movd EAX XMM1 ; 0th bit to 31th bit mov? ECX XMM1 ; 32th bit to 63th bit mov? EDX XMM1 ; 64th bit to 95th bit mov? ESI XMM1 ; 96th bit to 127 bit

SIMD optimization of cvtColor using ARM NEON intrinsics

流过昼夜 提交于 2019-12-03 12:09:26
I'm working on a SIMD optimization of BGR to grayscale conversion which is equivalent to OpenCV's cvtColor() function . There is an Intel SSE version of this function and I'm referring to it. (What I'm doing is basically converting SSE code to NEON code.) I've almost finished writing the code, and can compile it with g++, but I can't get the proper output. Does anyone have any ideas what the error could be? What I'm getting (incorrect): What I should be getting: Here's my code: #include <opencv/cv.hpp> #include <opencv/highgui.h> #include <arm_neon.h> //#include <iostream> using namespace std;

Does gcc use Intel's SSE 4.2 instructions for text processing if available?

痴心易碎 提交于 2019-12-03 12:08:34
I read here that Intel introduced SSE 4.2 instructions for accelerating string processing. Quote from the article: The SSE 4.2 instruction set, first implemented in Intel's Core i7, provides string and text processing instructions (STTNI) that utilize SIMD operations for processing character data. Though originally conceived for accelerating string, text, and XML processing, the powerful new capabilities of these instructions are useful outside of these domains, and it is worth revisiting the search and recognition stages of numerous applications to utilize STTNI to improve performance Does