sse | 易学教程

C - How to access elements of vector using GCC SSE vector extension

阅读更多关于 C - How to access elements of vector using GCC SSE vector extension

问题 Usually I work with 3D vectors using following types: typedef vec3_t float[3]; initializing vectors using smth. like: vec3_t x_basis = {1.0, 0.0, 0.0}; vec3_t y_basis = {0.0, 1.0, 0.0}; vec3_t z_basis = {0.0, 0.0, 1.0}; and accessing them using smth. like: x_basis[X] * y_basis[X] + ... Now I need a vector arithmetics using SSE instructions. I have following code: typedef float v4sf __attribute__ ((mode(V4SF))) int main(void) { v4sf a,b,c; a = (v4sf){0.1f,0.2f,0.3f,0.4f}; b = (v4sf){0.1f,0.2f

Fast counting the number of equal bytes between two arrays [duplicate]

阅读更多关于 Fast counting the number of equal bytes between two arrays [duplicate]

问题 This question already has answers here : Can counting byte matches between two strings be optimized using SIMD? (3 answers) Closed 8 months ago . I wrote the function int compare_16bytes(__m128i lhs, __m128i rhs) in order to compare two 16 byte numbers using SSE instructions: this function returns how many bytes are equal after performing the comparison. Now I would like use the above function in order to compare two byte arrays of arbitrary length: the length may not be a multiple of 16

Speed up matrix multiplication by SSE (C++)

阅读更多关于 Speed up matrix multiplication by SSE (C++)

问题 I need to run a matrix-vector multiplication 240000 times per second. The matrix is 5x5 and is always the same, whereas the vector changes at each iteration. The data type is float. I was thinking of using some SSE (or similar) instructions. 1) I am concerned that the number of arithmetic operations is too small compared to the number of memory operations involved. Do you think I can get some tangible (e.g. > 20%) improvement? 2) Do I need the Intel compiler to do it? 3) Can you point out

How to compare __m128 types?

阅读更多关于 How to compare __m128 types?

问题 __m128 a; __m128 b; How to code a != b ? what to use: _mm_cmpneq_ps or _mm_cmpneq_ss ? How to process the result ? Can't find adequate docs. 回答1: You should probably use _mm_cmpneq_ps . However the interpretation of comparisons is a little different with SIMD code than with scalar code. Do you want to test for any corresponding element not being equal ? Or all corresponding elements not being equal ? To test the results of the 4 comparisons from _mm_cmpneq_ps you can use _mm_movemask_epi8 .

The indices of non-zero bytes of an SSE/AVX register

阅读更多关于 The indices of non-zero bytes of an SSE/AVX register

问题 If an SSE/AVX register's value is such that all its bytes are either 0 or 1, is there any way to efficiently get the indices of all non zero elements? For example, if xmm value is | r0=0 | r1=1 | r2=0 | r3=1 | r4=0 | r5=1 | r6=0 |...| r14=0 | r15=1 | the result should be something like (1, 3, 5, ... , 15). The result should be placed in another _m128i variable or char[16] array. If it helps, we can assume that register's value is such that all bytes are either 0 or some constant nonzero value

How to convert a hex float to a float in C/C++ using _mm_extract_ps SSE GCC instrinc function

阅读更多关于 How to convert a hex float to a float in C/C++ using _mm_extract_ps SSE GCC instrinc function

问题 I'm writing a SSE code to 2D convolution but SSE documentation is very sparse. I'm calculating dot product with _mm_dp_ps and using _mm_extract_ps to get the dot product result, but _mm_extract_ps returns a hex that represents a float and I can't figure out how to convert this hex float to a regular float. I could use __builtin_ia32_vec_ext_v4sf that returns a float but I wanna keep compatibility with others compilers. _mm_extract_ps (__m128 __X, const int __N) { union { int i; float f; } _

Efficient SSE NxN matrix multiplication

阅读更多关于 Efficient SSE NxN matrix multiplication

问题 I'm trying to implement SSE version of large matrix by matrix multiplication. I'm looking for an efficient algorithm based on SIMD implementations. My desired method looks like: A(n x m) * B(m x k) = C(n x k) And all matrices are considered to be 16-byte aligned float array. I searched the net and found some articles describing 8x8 multiplication and even smaller. I really need it as efficient as possible and I don't want to use Eigen library or similar libraries. (Only SSE3 to be more

developing for new instruction sets

阅读更多关于 developing for new instruction sets

问题 Intel is set to release a new instruction set called AVX, which includes an extension of SSE to 256-bit operation. That is, either 4 double-precision elements or 8 single-precision elements. How would one go about developing code for AVX, considering there's no hardware out there that supports it yet? More generally, how can developers write code for hardware that doesn't exist, for instance if they want to have software ready when the supporting CPU is released? 回答1: Maybe I'm missing

SSE _mm_movemask_epi8 equivalent method for ARM NEON

阅读更多关于 SSE _mm_movemask_epi8 equivalent method for ARM NEON

问题 I decided to continue Fast corners optimisation and stucked at _mm_movemask_epi8 SSE instruction. How can i rewrite it for ARM Neon with uint8x16_t input? 回答1: I know this post is quite outdated but I found it useful to give my (validated) solution. It assumes all ones/all zeroes in every lane of the Input argument. const uint8_t __attribute__ ((aligned (16))) _Powers[16]= { 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 }; // Set the powers of 2 (do it once for all, if applicable)

SSE: convert short integer to float

阅读更多关于 SSE: convert short integer to float

问题 I want to convert an array of unsigned short numbers to float using SSE. Let's say __m128i xVal; // Has 8 16-bit unsigned integers __m128 y1, y2; // 2 xmm registers for 8 float values I want first 4 uint16 in y1 & next 4 uint16 in y2. Need to know which sse intrinsic to use. 回答1: You need to first unpack your vector of 8 x 16 bit unsigned shorts into two vectors of 32 bit unsigned ints, then convert each of these vectors to float: __m128i xlo = _mm_unpacklo_epi16(x, _mm_set1_epi16(0)); _