sse

Calculating matrix product is much slower with SSE than with straight-forward-algorithm

半世苍凉 提交于 2019-11-30 16:04:49
问题 I want to multiply two matrices, one time by using the straight-forward-algorithm: template <typename T> void multiplicate_straight(T ** A, T ** B, T ** C, int sizeX) { T ** D = AllocateDynamicArray2D<T>(sizeX, sizeX); transpose_matrix(B, D,sizeX); for(int i = 0; i < sizeX; i++) { for(int j = 0; j < sizeX; j++) { for(int g = 0; g < sizeX; g++) { C[i][j] += A[i][g]*D[j][g]; } } } FreeDynamicArray2D<T>(D); } and one time via using SSE functions. For this I created two functions: template

Calculating matrix product is much slower with SSE than with straight-forward-algorithm

旧时模样 提交于 2019-11-30 15:33:35
I want to multiply two matrices, one time by using the straight-forward-algorithm: template <typename T> void multiplicate_straight(T ** A, T ** B, T ** C, int sizeX) { T ** D = AllocateDynamicArray2D<T>(sizeX, sizeX); transpose_matrix(B, D,sizeX); for(int i = 0; i < sizeX; i++) { for(int j = 0; j < sizeX; j++) { for(int g = 0; g < sizeX; g++) { C[i][j] += A[i][g]*D[j][g]; } } } FreeDynamicArray2D<T>(D); } and one time via using SSE functions. For this I created two functions: template <typename T> void SSE_vectormult(T * A, T * B, int size) { __m128d a; __m128d b; __m128d c; #ifdef linux

SIMD code runs slower than scalar code

喜夏-厌秋 提交于 2019-11-30 15:17:42
elma and elmc are both unsigned long arrays. So are res1 and res2 . unsigned long simdstore[2]; __m128i *p, simda, simdb, simdc; p = (__m128i *) simdstore; for (i = 0; i < _polylen; i++) { u1 = (elma[i] >> l) & 15; u2 = (elmc[i] >> l) & 15; for (k = 0; k < 20; k++) { //res1[i + k] ^= _mulpre1[u1][k]; //res2[i + k] ^= _mulpre2[u2][k]; simda = _mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]); simdb = _mm_set_epi64x (res2[i + k], res1[i + k]); simdc = _mm_xor_si128 (simda, simdb); _mm_store_si128 (p, simdc); res1[i + k] = simdstore[0]; res2[i + k] = simdstore[1]; } } Within the for loop is

AVX/SSE version of xorshift128+

蓝咒 提交于 2019-11-30 14:58:49
问题 I am trying to make the fastest possible high quality RNG. Having read http://xorshift.di.unimi.it/ , xorshift128+ seems like a good option. The C code is #include <stdint.h> uint64_t s[ 2 ]; uint64_t next(void) { uint64_t s1 = s[ 0 ]; const uint64_t s0 = s[ 1 ]; s[ 0 ] = s0; s1 ^= s1 << 23; // a return ( s[ 1 ] = ( s1 ^ s0 ^ ( s1 >> 17 ) ^ ( s0 >> 26 ) ) ) + s0; // b, c } I am not an SSE/AVX expert sadly but my CPU supports SSE4.1 / SSE4.2 / AVX / F16C / FMA3 / XOP instructions. How could

Flipping sign on packed SSE floats

跟風遠走 提交于 2019-11-30 14:44:56
问题 I'm looking for the most efficient method of flipping the sign on all four floats packed in an SSE register. I have not found an intrinsic for doing this in the Intel Architecture software dev manual. Below are the things I've already tried. For each case I looped over the code 10 billion times and got the wall-time indicated. I'm trying to at least match 4 seconds it takes my non-SIMD approach, which is using just the unary minus operator. [48 sec] _mm_sub_ps( _mm_setzero_ps(), vec ); [32

How to perform uint32/float conversion with SSE?

穿精又带淫゛_ 提交于 2019-11-30 13:59:45
In SSE there is a function _mm_cvtepi32_ps(__m128i input) which takes input vector of 32 bits wide signed integers ( int32_t ) and converts them into float s. Now, I want to interpret input integers as not signed. But there is no function _mm_cvtepu32_ps and I could not find an implementation of one. Do you know where I can find such a function or at least give a hint on the implementation? To illustrate the the difference in results: unsigned int a = 2480160505; // 10010011 11010100 00111110 11111001 float a1 = a; // 01001111 00010011 11010100 00111111; float a2 = (signed int)a; // 11001110

Have different optimizations (plain, SSE, AVX) in the same executable with C/C++

本秂侑毒 提交于 2019-11-30 13:58:06
I'm developing optimizations for my 3D calculations and I now have: a " plain " version using the standard C language libraries, an SSE optimized version that compiles using a preprocessor #define USE_SSE , an AVX optimized version that compiles using a preprocessor #define USE_AVX Is it possible to switch between the 3 versions without having to compile different executables (ex. having different library files and loading the "right" one dynamically, don't know if inline functions are "right" for that)? I'd consider also performances in having this kind of switch in the software. There are

Why don't GCC and Clang use cvtss2sd [memory]?

一世执手 提交于 2019-11-30 13:09:07
I'm trying to optimize some code that's supposed to read single precision floats from memory and perform arithmetic on them in double precision. This is becoming a significant performance bottleneck, as the code that stores data in memory as single precision is substantially slower than equivalent code that stores data in memory as double precision. Below is a toy C++ program that captures the essence of my issue: #include <cstdio> // noinline to force main() to actually read the value from memory. __attributes__ ((noinline)) float* GetFloat() { float* f = new float; *f = 3.14; return f; } int

How to divide 16-bit integer by 255 with using SSE?

隐身守侯 提交于 2019-11-30 12:54:52
I deal with image processing. I need to divide 16-bit integer SSE vector by 255. I can't use shift operator like _mm_srli_epi16(), because 255 is not a multiple of power of 2. I know of course that it is possible convert integer to float, perform division and then back conversion to integer. But might somebody knows another solution... There is an integer approximation of division by 255: inline int DivideBy255(int value) { return (value + 1 + (value >> 8)) >> 8; } So with using of SSE2 it will look like: inline __m128i DivideI16By255(__m128i value) { return _mm_srli_epi16(_mm_add_epi16( _mm

C++ use SSE instructions for comparing huge vectors of ints

Deadly 提交于 2019-11-30 12:43:02
I have a huge vector<vector<int>> (18M x 128). Frequently I want to take 2 rows of this vector and compare them by this function: int getDiff(int indx1, int indx2) { int result = 0; int pplus, pminus, tmp; for (int k = 0; k < 128; k += 2) { pplus = nodeL[indx2][k] - nodeL[indx1][k]; pminus = nodeL[indx1][k + 1] - nodeL[indx2][k + 1]; tmp = max(pplus, pminus); if (tmp > result) { result = tmp; } } return result; } As you see, the function, loops through the two row vectors does some subtraction and at the end returns a maximum. This function will be used a million times, so I was wondering if