sse

Is my understanding of AoS vs SoA advantages/disadvantages correct?

早过忘川 提交于 2019-11-28 21:26:57
问题 I've recently been reading about AoS vs SoA structure design and data-oriented design. It's oddly difficult to find information about either, and what I have found seems to assume greater understanding of processor functionality than I possess. That said, what I do understand about the former topic in particular leads to some questions that I think I should be able to understand the answers to. Firstly, to make sure I am not basing my understanding off of a false premise, my understanding of

How to compare __m128 types?

耗尽温柔 提交于 2019-11-28 20:53:36
__m128 a; __m128 b; How to code a != b ? what to use: _mm_cmpneq_ps or _mm_cmpneq_ss ? How to process the result ? Can't find adequate docs. You should probably use _mm_cmpneq_ps . However the interpretation of comparisons is a little different with SIMD code than with scalar code. Do you want to test for any corresponding element not being equal ? Or all corresponding elements not being equal ? To test the results of the 4 comparisons from _mm_cmpneq_ps you can use _mm_movemask_epi8 . Note that comparing floating point values for equality or inequality is usually a bad idea, except in very

Using SSE instructions

岁酱吖の 提交于 2019-11-28 16:46:34
问题 I have a loop written in C++ which is executed for each element of a big integer array. Inside the loop, I mask some bits of the integer and then find the min and max values. I heard that if I use SSE instructions for these operations it will run much faster compared to a normal loop written using bitwise AND , and if-else conditions. My question is should I go for these SSE instructions? Also, what happens if my code runs on a different processor? Will it still work or these instructions are

Parallel programming using Haswell architecture [closed]

旧城冷巷雨未停 提交于 2019-11-28 16:35:39
I want to learn about parallel programming using Intel's Haswell CPU microarchitecture. About using SIMD: SSE4.2, AVX2 in asm/C/C++/(any other langs)?. Can you recommend books, tutorials, internet resources, courses? Thanks! Z boson It sounds to me like you need to learn about parallel programming in general on the CPU. I started looking into this about 10 months ago before I ever used SSE, OpenMP, or intrinsics so let me give a brief summary of some important concepts I have learned and some useful resources. There are several parallel computing technologies that can be employed: MIMD, SIMD,

Horizontal sum of 32-bit floats in 256-bit AVX vector [duplicate]

白昼怎懂夜的黑 提交于 2019-11-28 14:25:15
This question already has an answer here: How to sum __m256 horizontally? 2 answers I have two arrays of floats and I would like to calculate the dot product, using SSE and AVX, in the lowest latency possible. I am aware there is a 256-bit dot product intrinsic for floats but I have read on SO that this is slower than the below technique: ( https://stackoverflow.com/a/4121295/997112 ). I have done most of the work, the vector temp_sums contains all the sums, I just need to sum all the eight 32-bit sums contained within temp_sum at the end. #include "xmmintrin.h" #include "immintrin.h" int main

Function crashes when using _mm_load_pd

岁酱吖の 提交于 2019-11-28 14:23:15
I have the following function: template <typename T> void SSE_vectormult(T * A, T * B, int size) { __m128d a; __m128d b; __m128d c; double A2[2], B2[2], C[2]; const double * A2ptr, * B2ptr; A2ptr = &A2[0]; B2ptr = &B2[0]; a = _mm_load_pd(A); for(int i = 0; i < size; i+=2) { std::cout << "In SSE_vectormult: i is: " << i << '\n'; A2[0] = A[i]; B2[0] = B[i]; A2[1] = A[i+1]; B2[1] = B[i+1]; std::cout << "Values from A and B written to A2 and B2\n"; a = _mm_load_pd(A2ptr); b = _mm_load_pd(B2ptr); std::cout << "Values converted to a and b\n"; c = _mm_mul_pd(a,b); _mm_store_pd(C, c); A[i] = C[0]; A[i

Convention for displaying vector registers

这一生的挚爱 提交于 2019-11-28 14:19:24
Is there a convention for displaying/writing large registers, like those available in the Intel AVX instruction set? For example, if you have 1 in the least significant byte, and 20 in the most significant byte, and 0 elsewhere in an xmm register, for a byte-wise display is the following preferred (little-endian): [1, 0, 0, 0, ..., 0, 20] or is this preferred: [20, 0, 0, 0, ..., 0, 1] Similarly, when displaying such registers as made up of larger data items, is the same rule applied? E.g., to display the register as DWORDs, I assume each DWORD is still written in the usual (big-endian) way,

Can XMM registers be used to do any 128 bit integer math? [duplicate]

流过昼夜 提交于 2019-11-28 14:04:01
This question already has an answer here: Is it possible to use SSE and SSE2 to make a 128-bit wide integer? 1 answer My impression is definitely not but perhaps there is a clever trick? Thanks. Not directly, but there are 64 bit arithmetic operations which can be easily combined to perform 128 bit (or greater) precision. The xmm registers can do arithmetics on 8, 16, 32 and 64 bit integers. It doesn't produce a carry flag so you can't extend the precision beyond 64 bits. The extended precision math libraries use the general purpose registers which are 32 bit or 64 bit, depending on the OS. 来源

The indices of non-zero bytes of an SSE/AVX register

随声附和 提交于 2019-11-28 13:34:54
If an SSE/AVX register's value is such that all its bytes are either 0 or 1, is there any way to efficiently get the indices of all non zero elements? For example, if xmm value is | r0=0 | r1=1 | r2=0 | r3=1 | r4=0 | r5=1 | r6=0 |...| r14=0 | r15=1 | the result should be something like (1, 3, 5, ... , 15). The result should be placed in another _m128i variable or char[16] array. If it helps, we can assume that register's value is such that all bytes are either 0 or some constant nonzero value (not necessary 1). I am pretty much wondering if there is an instruction for that or preferably C/C++

How to convert a hex float to a float in C/C++ using _mm_extract_ps SSE GCC instrinc function

泪湿孤枕 提交于 2019-11-28 12:22:29
I'm writing a SSE code to 2D convolution but SSE documentation is very sparse. I'm calculating dot product with _mm_dp_ps and using _mm_extract_ps to get the dot product result, but _mm_extract_ps returns a hex that represents a float and I can't figure out how to convert this hex float to a regular float. I could use __builtin_ia32_vec_ext_v4sf that returns a float but I wanna keep compatibility with others compilers. _mm_extract_ps (__m128 __X, const int __N) { union { int i; float f; } __tmp; __tmp.f = __builtin_ia32_vec_ext_v4sf ((__v4sf)__X, __N); return __tmp.i; } What point I'm missing?