sse

Error : casting user defined data types in c

那年仲夏 提交于 2019-11-28 09:37:23
问题 This is a simpler view of my Problem, I want to convert a float value into defined type v4si (I want to use SIMD Operation for optimization.) Please help to convert float/double value to a defined type. #include<stdio.h> typedef double v4si __attribute__ ((vector_size (16))); int main() { double stoptime=36000; float x =0.5*stoptime; float * temp = &x; v4si a = ((v4si)x); // Error: Incompatible data types v4si b; v4si *c; c = ((v4si*)&temp); // Copies address of temp, b = *(c); printf("%f\n"

SSE: convert short integer to float

元气小坏坏 提交于 2019-11-28 09:15:39
I want to convert an array of unsigned short numbers to float using SSE. Let's say __m128i xVal; // Has 8 16-bit unsigned integers __m128 y1, y2; // 2 xmm registers for 8 float values I want first 4 uint16 in y1 & next 4 uint16 in y2. Need to know which sse intrinsic to use. You need to first unpack your vector of 8 x 16 bit unsigned shorts into two vectors of 32 bit unsigned ints, then convert each of these vectors to float: __m128i xlo = _mm_unpacklo_epi16(x, _mm_set1_epi16(0)); __m128i xhi = _mm_unpackhi_epi16(x, _mm_set1_epi16(0)); __m128 ylo = _mm_cvtepi32_ps(xlo); __m128 yhi = _mm

SIMD math libraries for SSE and AVX

牧云@^-^@ 提交于 2019-11-28 08:26:06
I am looking for SIMD math libraries (preferably open source) for SSE and AVX. I mean for example if I have a AVX register v with 8 float values I want sin(v) to return the sin of all eight values at once. AMD has a propreitery library, LibM http://developer.amd.com/tools/cpu-development/libm/ which has some SIMD math functions but LibM only uses AVX if it detects FMA4 which Intel CPUs don't have. Also I'm not sure it fully uses AVX as all the function names end in s4 (d2) and not s8 (d4). It give better performance than the standard math libraries on Intel CPUs but it's not much better. Intel

SSE instructions to add all elements of an array [duplicate]

坚强是说给别人听的谎言 提交于 2019-11-28 07:48:31
This question already has an answer here: Sum reduction of unsigned bytes without overflow, using SSE2 on Intel 2 answers I am new to SSE2 instructions. I have found an instruction _mm_add_epi8 which can add two array elements. But I want an SSE instruction which can add all elements of an array. I was trying to develop this concept using this code: #include <iostream> #include <conio.h> #include <emmintrin.h> void sse(unsigned char* a,unsigned char* b); void main() { /*unsigned char *arr; arr=(unsigned char *)malloc(50);*/ unsigned char arr[]={'a','b','c','d','e','f','i','j','k','l','m','n',

SSE multiplication of 2 64-bit integers

一曲冷凌霜 提交于 2019-11-28 07:36:41
问题 How to multiply two 64-bit integers by another 2 64-bit integers? I didn't find any instruction which can do it. 回答1: I know this is an old question but I was actually looking for exactly this. As there's still no instruction for it I implemented the 64 bit multiply myself with the pmuldq as Paul R mentioned. This is what I came up with: // requires g++ -msse4.1 ... #include <emmintrin.h> #include <smmintrin.h> __m128i Multiply64Bit(__m128i a, __m128i b) { auto ax0_ax1_ay0_ay1 = a; auto bx0

Does x86-SSE-instructions have an automatic release-acquire order?

夙愿已清 提交于 2019-11-28 07:30:13
问题 As we know from from C11-memory_order: http://en.cppreference.com/w/c/atomic/memory_order And the same from C++11-std::memory_order: http://en.cppreference.com/w/cpp/atomic/memory_order On strongly-ordered systems ( x86 , SPARC, IBM mainframe), release-acquire ordering is automatic. No additional CPU instructions are issued for this synchronization mode , only certain compiler optimizations are affected (e.g. the compiler is prohibited from moving non-atomic stores past the atomic store

Crash after m = XMMatrixIdentity() - alignment memory in classes?

£可爱£侵袭症+ 提交于 2019-11-28 07:20:52
问题 I was looking at the tutorials in DirectX SDK. Tutorial 5 works fine, but after I have copied and separated the code to my own classes, I got strange error during launching my application. The line is: g_World1 = XMMatrixIdentity(); Because of it, I got error in xnamathmatrix.int operator= which looks like that: XMFINLINE _XMMATRIX& _XMMATRIX::operator= ( CONST _XMMATRIX& M ) { r[0] = M.r[0]; r[1] = M.r[1]; r[2] = M.r[2]; r[3] = M.r[3]; return *this; } And the error message is: Access

Fast 24-bit array -> 32-bit array conversion?

女生的网名这么多〃 提交于 2019-11-28 06:58:55
Quick Summary: I have an array of 24-bit values. Any suggestion on how to quickly expand the individual 24-bit array elements into 32-bit elements? Details: I'm processing incoming video frames in realtime using Pixel Shaders in DirectX 10. A stumbling block is that my frames are coming in from the capture hardware with 24-bit pixels (either as YUV or RGB images), but DX10 takes 32-bit pixel textures. So, I have to expand the 24-bit values to 32-bits before I can load them into the GPU. I really don't care what I set the remaining 8 bits to, or where the incoming 24-bits are in that 32-bit

Fast counting the number of set bits in __m128i register

a 夏天 提交于 2019-11-28 06:35:28
I should count the number of set bits of a __m128i register. In particular, I should write two functions that are able to count the number of bits of the register, using the following ways. The total number of set bits of the register. The number of set bits for each byte of the register. Are there intrinsic functions that can perform, wholly or partially, the above operations? Here are some codes I used in an old project ( there is a research paper about it ). The function popcnt8 below computes the number of bits set in each byte. SSE2-only version (based on Algorithm 3 in Hacker's Delight

SSE instruction MOVSD (extended: floating point scalar & vector operations on x86, x86-64)

六月ゝ 毕业季﹏ 提交于 2019-11-28 06:11:00
问题 I am somehow confused by the MOVSD assembly instruction. I wrote some numerical code computing some matrix multiplication, simply using ordinary C code with no SSE intrinsics. I do not even include the header file for SSE2 intrinsics for compilation. But when I check the assembler output, I see that: 1) 128-bit vector registers XMM are used; 2) SSE2 instruction MOVSD is invoked. I understand that MOVSD essentially operates on single double precision floating point. It only uses the lower 64