sse

_mm_crc32_u64 poorly defined

核能气质少年 提交于 2019-11-30 20:40:50
Why in the world was _mm_crc32_u64(...) defined like this? unsigned int64 _mm_crc32_u64( unsigned __int64 crc, unsigned __int64 v ); The "crc32" instruction always accumulates a 32-bit CRC, never a 64-bit CRC (It is, after all, CRC32 not CRC64). If the machine instruction CRC32 happens to have a 64-bit destination operand, the upper 32 bits are ignored, and filled with 0's on completion, so there is NO use to EVER have a 64-bit destination. I understand why Intel allowed a 64-bit destination operand on the instruction (for uniformity), but if I want to process data quickly, I want a source

How to negate (change sign) of the floating point elements in a __m128 type variable?

冷暖自知 提交于 2019-11-30 20:36:46
Is there any single instruction or function that can invert the sign of every float inside a __m128? i.e. a = r0:r1:r2:r3 ===> a = -r0:-r1:-r2:-r3 ? I know this can be done by _mm_sub_ps(_mm_set1_ps(0.0),a) , but isn't it potentially slow since _mm_set1_ps(0.0) is a multi-instruction function? In practice your compiler should do a good job of generating the constant vector for 0.0. It will probably just use _mm_xor_ps , and if your code is in a loop it should hoist the constant generation out of the loop anyway. So, bottom line, use your original idea of: v = _mm_sub_ps(_mm_set1_ps(0.0), v);

How to align stack at 32 byte boundary in GCC?

不打扰是莪最后的温柔 提交于 2019-11-30 19:37:22
I'm using MinGW64 build based on GCC 4.6.1 for Windows 64bit target. I'm playing around with the new Intel's AVX instructions. My command line arguments are -march=corei7-avx -mtune=corei7-avx -mavx . But I started running into segmentation fault errors when allocating local variables on the stack. GCC uses the aligned moves VMOVAPS and VMOVAPD to move __m256 and __m256d around, and these instructions require 32-byte alignment. However, the stack for Windows 64bit has only 16 byte alignment. How can I change the GCC's stack alignment to 32 bytes? I have tried using -mstackrealign but to no

Can counting byte matches between two strings be optimized using SIMD?

三世轮回 提交于 2019-11-30 19:26:47
Profiling suggests that this function here is a real bottle neck for my application: static inline int countEqualChars(const char* string1, const char* string2, int size) { int r = 0; for (int j = 0; j < size; ++j) { if (string1[j] == string2[j]) { ++r; } } return r; } Even with -O3 and -march=native , G++ 4.7.2 does not vectorize this function (I checked the assembler output). Now, I'm not an expert with SSE and friends, but I think that comparing more than one character at once should be faster. Any ideas on how to speed things up? Target architecture is x86-64. Sam Compiler flags for

Integer dot product using SSE/AVX?

我只是一个虾纸丫 提交于 2019-11-30 18:44:30
问题 I am looking at the intel intrinsic guide: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ and whilst they have _mm_dp_ps and _mm_dp_pd for calculating the dot product for floats and doubles I cannot see anything for calculating the integer dot product. I have two unsigned int[8] arrays and I would like to: (a[0] x b[0]) + (a[1] * b[1])....... + (a[num_elements_in_array-1] * b[num_elements_in_array-1]) (in batches of four) and sum the products? 回答1: Every time someone does this:

fastest way to fill a vector (SSE2) with a certain value. Templates friendly

早过忘川 提交于 2019-11-30 18:22:22
问题 I have this template class: template<size_t D> struct A{ double v_sse __attribute__ ((vector_size (8*D))); A(double val){ //what here? } }; What's the best way to fill the v_sse field with copies of val ? Since I use vectors, I can use gcc SSE2 intrinsics. 回答1: It would be nice if we could write code once, and compile it for wider vectors with just a small tweak, even in cases where auto-vectorization doesn't do the trick. I got the same result as @hirschhornsalz: massive, inefficient code

8 bit shift operation in AVX2 with shifting in zeros

爱⌒轻易说出口 提交于 2019-11-30 18:20:59
Is there any way to rebuild the _mm_slli_si128 instruction in AVX2 to shift an __mm256i register by x bytes? The _mm256_slli_si256 seems just to execute two _mm_slli_si128 on a[127:0] and a[255:128]. The left shift should work on a __m256i like this: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ..., 32] -> [2, 3, 4, 5, 6, 7, 8, 9, ..., 0] I saw in thread that it is possible to create a shift with _mm256_permutevar8x32_ps for 32bit. But I need a more generic solution to shift by x bytes. Has anybody already a solution for this problem? okay I implemented a function that can shift left up to 16 byte.

Why do some SSE “mov” instructions specify that they move floating-point values?

↘锁芯ラ 提交于 2019-11-30 17:17:36
Many SSE "mov" instructions specify that they are moving floating-point values. For example: MOVHLPS—Move Packed Single-Precision Floating-Point Values High to Low MOVSD—Move Scalar Double-Precision Floating-Point Value MOVUPD—Move Unaligned Packed Double-Precision Floating-Point Values Why don't these instructions simply say that they move 32-bit or 64-bit values? If they're just moving bits around, why do the instructions specify that they are for floating-point values? Surely they would work whether you interpret those bits as floating-point or not? Josh Haberman I think I've found the

Why do some SSE “mov” instructions specify that they move floating-point values?

爱⌒轻易说出口 提交于 2019-11-30 16:33:50
问题 Many SSE "mov" instructions specify that they are moving floating-point values. For example: MOVHLPS—Move Packed Single-Precision Floating-Point Values High to Low MOVSD—Move Scalar Double-Precision Floating-Point Value MOVUPD—Move Unaligned Packed Double-Precision Floating-Point Values Why don't these instructions simply say that they move 32-bit or 64-bit values? If they're just moving bits around, why do the instructions specify that they are for floating-point values? Surely they would

Multiplying vector by constant using SSE

送分小仙女□ 提交于 2019-11-30 16:23:14
I have some code that operates on 4D vectors and I'm currently trying to convert it to use SSE. I'm using both clang and gcc on 64b linux. Operating only on vectors is all fine -grasped that. But now comes a part where i have to multiply an entire vector by a single constant - Something like this: float y[4]; float a1 = 25.0/216.0; for(j=0; j<4; j++){ y[j] = a1 * x[j]; } to something like this: float4 y; float a1 = 25.0/216.0; y = a1 * x; where: typedef double v4sf __attribute__ ((vector_size(4*sizeof(float)))); typedef union float4{ v4sf v; float x,y,z,w; } float4; This of course will not