sse | 易学教程

SSE vector wrapper type performance compared to bare __m128

阅读更多关于 SSE vector wrapper type performance compared to bare __m128

问题 I found an interesting Gamasutra article about SIMD pitfalls, which states that it is not possible to reach the performance of the "pure" __m128 type with wrapper types. Well I was skeptical, so I downloaded the project files and fabricated a comparable test case. It turned out (for my surprise) that the wrapper version is significantly slower. Since I don't want to talk about just the thin air, the test cases are the following: In the 1st case Vec4 is a simple alias of the __m128 type with

SSE (SIMD): multiply vector by scalar

阅读更多关于 SSE (SIMD): multiply vector by scalar

A common operation I do in my program is scaling vectors by a scalar (V*s, e.g. [1,2,3,4]*2 == [2,4,6,8]). Is there a SSE (or AVX) instruction to do this, other than first loading the scalar in every position in a vector (e.g. _mm_set_ps(2,2,2,2)) and then multiplying? This is what I do now: __m128 _scalar = _mm_set_ps(s,s,s,s); __m128 _result = _mm_mul_ps(_vector, _scalar); I'm looking for something like... __m128 _result = _mm_scale_ps(_vector, s); Depending on your compiler you may be able to improve the code generation a little by using _mm_set1_ps : const __m128 scalar = _mm_set1_ps(s); _

Optimizing Array Compaction

阅读更多关于 Optimizing Array Compaction

Let's say I have an array k = [1 2 0 0 5 4 0] I can compute a mask as follows m = k > 0 = [1 1 0 0 1 1 0] Using only the mask m and the following operations Shift left / right And/Or Add/Subtract/Multiply I can compact k into the following [1 2 5 4] Here's how I currently do it (MATLAB pseudocode): function out = compact( in ) d = in for i = 1:size(in, 2) %do (# of items in in) passes m = d > 0 %shift left, pad w/ 0 on right ml = [m(2:end) 0] % shift dl = [d(2:end) 0] % shift %if the data originally has a gap, fill it in w/ the %left shifted one use = (m == 0) & (ml == 1) %2 comparison d = use

SSE: unaligned load and store that crosses page boundary

阅读更多关于 SSE: unaligned load and store that crosses page boundary

I read somewhere that before performing unaligned load or store next to page boundary (e.g. using _mm_loadu_si128 / _mm_storeu_si128 intrinsics), code should first check if whole vector (in this case 16 bytes) belongs to the same page, and switch to non-vector instructions if not. I understand that this is needed to prevent coredump if next page does not belong to process. But what if both pages belongs to process (e.g. they are part of one buffer, and I know size of that buffer)? I wrote small test program which performed unaligned load and store that crossed page boundary, and it did not

Common SIMD techniques

阅读更多关于 Common SIMD techniques

问题 Where can I find information about common SIMD tricks? I have an instruction set and know, how to write non-tricky SIMD code, but I know, SIMD now is much more powerful. It can hold complex conditional branchless code. For example ( ARMv6 ), the following sequence of instructions sets each byte of Rd equal to the unsigned minimum of the corresponding bytes of Ra and Rb: USUB8 Rd, Ra, Rb SEL Rd, Rb, Ra Links to tutorials / uncommon SIMD techniques are good too :) ARMv6 is the most interesting

SSE multiplication 16 x uint8_t

阅读更多关于 SSE multiplication 16 x uint8_t

I want to multiply with SSE4 a __m128i object with 16 unsigned 8 bit integers, but I could only find an intrinsic for multiplying 16 bit integers. Is there nothing such as _mm_mult_epi8 ? Marat Dukhan There is no 8-bit multiplication in MMX/SSE/AVX. However, you can emulate 8-bit multiplication intrinsic using 16-bit multiplication as follows: inline __m128i _mm_mullo_epi8(__m128i a, __m128i b) { __m128i zero = _mm_setzero_si128(); __m128i Alo = _mm_cvtepu8_epi16(a); __m128i Ahi = _mm_unpackhi_epi8(a, zero); __m128i Blo = _mm_cvtepu8_epi16(b); __m128i Bhi = _mm_unpackhi_epi8(b, zero); __m128i

does rewriting memcpy/memcmp/… with SIMD instructions make sense

阅读更多关于 does rewriting memcpy/memcmp/… with SIMD instructions make sense

Does rewriting memcpy/memcmp/... with SIMD instructions make sense in a large scale software? If so, why gcc doesn't generate simd instructions for these library functions by default. Also, are there any other functions can be possibly improved by SIMD? Yes, these functions are much faster with SSE instructions. It would be nice if your runtime library/compiler instrinsics would include optimized versions, but that doesn't seem to be pervasive. I have a custom SIMD memchr which is a hell-of-a-lot faster than the library version. Especially when I'm finding the first of 2 or 3 characters

What's the differrence among cflgs sse options of -msse, -msse2, -mssse3, -msse4 rtc..? and how to determine?

阅读更多关于 What's the differrence among cflgs sse options of -msse, -msse2, -mssse3, -msse4 rtc..? and how to determine?

For the GCC CFLAGS options: -msse , -msse2 , -mssse3 , -msse4 , -msse4.1 , -msse4.2 . Are they exclusive in their use or can the be used together? My understanding is that the choosing which to set depends on whether the target arch, which the program will run on, supports it or not, is this correct? If so, how could I know what sse my target arch supports? In Linux, I cat /proc/cpuinfo, but what if mac or Windows? Thanks! The -m switched can be used in parallel, furthermore some of them are implied by the architecture or other switches. For instance, if you build code for x86_64, -msse -msse2

Benefits of x87 over SSE

阅读更多关于 Benefits of x87 over SSE

I know that x87 has higher internal precision, which is probably the biggest difference that people see between it and SSE operations. But I have to wonder, is there any other benefit to using x87? I have a habit of typing -mfpmath=sse automatically in any project, and I wonder if I'm missing anything else that the x87 FPU offers. Nils Pipenbrinck For hand-written asm, x87 has some instructions that don't exist in the SSE instruction set. Off the top of my head, it's all trigonometric stuff like fsin, fcos, fatan, fatan2 and some exponential/logarithm stuff. With gcc -O3 -ffast-math -mfpmath

How to compare two vectors using SIMD and get a single boolean result?

阅读更多关于 How to compare two vectors using SIMD and get a single boolean result?

I have two vectors of 4 integers each and I'd like to use a SIMD command to compare them (say generate a result vector where each entry is 0 or 1 according to the result of the comparison). Then, I'd like to compare the result vector to a vector of 4 zeros and only if they're equal do something. Do you know what SIMD commands I can use to do it? To compare two SIMD vectors: #include <stdint.h> #include <xmmintrin.h> int32_t __attribute__ ((aligned(16))) vector1[4] = { 1, 2, 3, 4 }; int32_t __attribute__ ((aligned(16))) vector2[4] = { 1, 2, 2, 2 }; int32_t __attribute__ ((aligned(16))) result[4