sse

How to specify alignment with _mm_mul_ps

回眸只為那壹抹淺笑 提交于 2019-12-11 02:19:09
问题 I am using an SSE intrinsic with one of the argument as a memory location ( _mm_mul_ps(xmm1,mem) ). I have a doubt which will be faster: xmm1 = _mm_mul_ps(xmm0,mem) // mem is 16 byte aligned or: xmm0 = _mm_load_ps(mem); xmm1 = _mm_mul_ps(xmm1,xmm0); Is there a way to specify alignment with _mm_mul_ps() intrinsic ? 回答1: There are no _mm_mul_ps(reg,mem) form even though mulps reg,mem instruction form exists - https://msdn.microsoft.com/en-us/library/22kbk6t9(v=vs.90).aspx What you can do is _mm

How to determine CPE: Cycles Per Element

为君一笑 提交于 2019-12-11 01:17:03
问题 How do I determine the CPE of a program? For example, I have this assembly code for a loop: # inner4: data_t = float # udata in %rbx, vdata in %rax, limit in %rcx, # i in %rdx, sum in %xmm1 1 .L87: # loop: 2 movss (%rbx,%rdx,4), %xmm0 # Get udata[i] 3 mulss (%rax,%rdx,4), %xmm0 # Multiply by vdata[i] 4 addss %xmm0, %xmm1 # Add to sum 5 addq $1, %rdx # Increment i 6 cmpq %rcx, %rdx # Compare i:limit 7 jl .L87 # If <, goto loop I have to find the lower bound of the CPE determined by the

My SSE implementation of lookAt doesn't work

99封情书 提交于 2019-12-11 00:39:48
问题 So, I'm writing a math library using SSE intrinsics to use with my OpenGL application. Right now I'm implementing some of the more important functions like lookAt, using the glm library to check for correctness, but for some reason my implementation of lookAt isn't working as it should. Here's the source code: inline void lookAt(__m128 position, __m128 target, __m128 up) { /* Get the target vector relative to the camera position */ __m128 t = vec4::normalize3(_mm_sub_ps(target, position)); _

C++ SIMD: Store uint64_t value after bitwise and operation

微笑、不失礼 提交于 2019-12-11 00:07:37
问题 I am trying to do a bitwise & between elements of two arrays of uint64_t integers and then store the result in another array. This is my program: #include <emmintrin.h> #include <nmmintrin.h> #include <chrono> int main() { uint64_t data[200]; uint64_t data2[200]; uint64_t data3[200]; __m128i* ptr = (__m128i*) data; __m128i* ptr2 = (__m128i*) data2; uint64_t* ptr3 = data3; for (int i = 0; i < 100; ++i, ++ptr, ++ptr2, ptr3 += 2) _mm_store_ps(ptr3, _mm_and_si128(*ptr, *ptr2)); } However, I get

Does _control87() also set the SSE MXCSR Control Register?

元气小坏坏 提交于 2019-12-10 21:23:13
问题 The documentation for _control87 notes: _control87 [...] affect[s] the control words for both the x87 and the SSE2, if present. It seems that the SSE and SSE2 MXCSR control registers are identical, however, there is no mention of the SSE unit in the documentation. Does _control87 affect an SSE unit's MXCSR control register or is this only true for SSE2? 回答1: I dug out an old Pentium III and checked with the following code: #include <Windows.h> #include <float.h> #include <xmmintrin.h>

tbb::cache_aligned_allocator: Getting “request for member…which is of non-class type” with __m128i. User error or bug?

你说的曾经没有我的故事 提交于 2019-12-10 18:57:01
问题 I'm trying to use __m128i as the value type of a cache-aligned vector with GCC, and I'm getting the following error: /usr/include/tbb/cache_aligned_allocator.h:105:32: error: request for member ‘~tbb::cache_aligned_allocator<__vector(2) long long int>::value_type’ in ‘* p’, which is of non-class type ‘tbb::cache_aligned_allocator<__vector(2) long long int>::value_type {aka __vector(2) long long int}’ The compiler traces it to the following line in tbb/cache_aligned_allocator.h: void destroy(

How to add SIMD-related compiler flags in visual studio 2010

核能气质少年 提交于 2019-12-10 18:29:27
问题 I found this list of flags: http://www.ncsa.illinois.edu/UserInfo/Resources/Software/Intel/Compilers/10.0/main_for/mergedProjects/optaps_for/common/optaps_dsp_targ.htm and I'd like to try and add some of them to my project. I can't seem to find the way to do it on a visual studio 2010 platform :( Does anyone know how to do it? Thanks!!! 回答1: The /arch flag in Visual Studio allows you to specify the target processor architecture, and includes support for SSE2, amongst others. This MSDN page

Modulo 2*Pi using SSE/SSE2

坚强是说给别人听的谎言 提交于 2019-12-10 18:09:41
问题 I'm still pretty new to using SSE and am trying to implement a modulo of 2*Pi for double-precision inputs of the order 1e8 (the result of which will be fed into some vectorised trig calculations). My current attempt at the code is based around the idea that mod(x, 2*Pi) = x - floor(x/(2*Pi))*2*Pi and looks like: #define _PD_CONST(Name, Val) \ static const double _pd_##Name[2] __attribute__((aligned(16))) = { Val, Val } _PD_CONST(2Pi, 6.283185307179586); /* = 2*pi */ _PD_CONST(recip_2Pi, 0

Calculate 4d vectors average with SSE

大憨熊 提交于 2019-12-10 17:52:49
问题 I try to speed up calculation of average of 4d vectors placed in an array. Here is my code: #include <sys/time.h> #include <sys/param.h> #include <stdlib.h> #include <stdio.h> #include <string.h> #include <xmmintrin.h> typedef float dot[4]; #define N 1000000 double gettime () { struct timeval tv; gettimeofday (&tv, 0); return (double)tv.tv_sec + (0.000001 * (double)tv.tv_usec); } void calc_avg1 (dot res, const dot array[], int n) { int i,j; memset (res, 0, sizeof (dot)); for (i = 0; i < n; i+

Get SSE version without __asm on x64

瘦欲@ 提交于 2019-12-10 17:13:25
问题 I'm trying to build slightly modified versions of some functions of the VS2010 CRT library, all is well except for the parts where it tries to access a global variable which presumably holds the instruction set architecture version (ISA): if (__isa_available > __ISA_AVAILABLE_SSE2) { // ... } else if (__isa_available == __ISA_AVAILABLE_SSE2) { // ... } The values it should hold I found in an assembly file __ISA_AVAILABLE_X86 equ 0 __ISA_AVAILABLE_SSE2 equ 1 __ISA_AVAILABLE_SSE42 equ 2 __ISA