sse

SSE half loads (_mm_loadh_pi / _mm_loadl_pi) issue warnings

孤人 提交于 2019-12-02 03:36:56
问题 I have borrowed a matrix inversion algorithm from Intel website: http://download.intel.com/design/PentiumIII/sml/24504301.pdf It uses _mm_loadh_pi and _mm_loadl_pi to load the 4x4 matrix coefficients and do a partial shuffling at the same time. The performance improvement in my app is significant, and if I do a classic load/shuffle of the matrix using _mm_load_ps, it's slightly slower. But this load approach issues compilation warnings : "tmp1 is used uninitialized in this function" __m128

acos(double) gives different result on x64 and x32 Visual Studio

孤街醉人 提交于 2019-12-02 03:24:18
acos(double) gives different result on x64 and x32 Visual Studio. printf("%.30g\n", double(acosl(0.49990774364240564))); printf("%.30g\n", acos(0.49990774364240564)); on x64: 1.0473040763868076 on x32: 1.0473040763868078 on linux4.4 x32 and x64 with sse enabled: 1.0473040763868078 is there a way to make VSx64 acos() give me 1.0473040763868078 as result? TL:DR: this is normal and you can't reasonably change it. The 32-bit library may be using 80-bit FP values in x87 registers for its temporaries, avoiding rounding off to 64-bit double after every operation. (Unless there's a whole separate

gcc 4.8 AVX optimization bug: extra code insertion?

此生再无相见时 提交于 2019-12-02 03:22:29
问题 It is great that gcc compiler 4.8 comes with AVX optimization with -Ofast option. However, I found an interesting but stupid bug, that it adds additional computations which are unnecessary. Maybe I am wrong so can someone give me an explanation? The original C++ source code is as follows: #define N 1000007 float a[N],b[N],c[N],d[N],e[N]; int main(int argc, char *argv[]){ cout << a << ' ' << b << ' ' << c << endl; for(int x=0; x<N; ++x){ c[x] = 1/sqrt((a[x]+b[x]-c[x])*d[x]/e[x]); } return 0; }

Compare 16 byte strings with SSE

我与影子孤独终老i 提交于 2019-12-02 02:59:47
问题 I have 16 byte 'strings' (they may be shorter but you may assume that they are padded with zeros at the end), but you may not assume they are 16 byte aligned (at least not always). How to write a routine that will compare them (for equality) with SSE intrinsics? I found this code fragment that could be of help but I', not sure if it is appropriate? register __m128i xmm0, xmm1; register unsigned int eax; xmm0 = _mm_load_epi128((__m128i*)(a)); xmm1 = _mm_load_epi128((__m128i*)(b)); xmm0 = _mm

SSE half loads (_mm_loadh_pi / _mm_loadl_pi) issue warnings

情到浓时终转凉″ 提交于 2019-12-02 02:38:13
I have borrowed a matrix inversion algorithm from Intel website: http://download.intel.com/design/PentiumIII/sml/24504301.pdf It uses _mm_loadh_pi and _mm_loadl_pi to load the 4x4 matrix coefficients and do a partial shuffling at the same time. The performance improvement in my app is significant, and if I do a classic load/shuffle of the matrix using _mm_load_ps, it's slightly slower. But this load approach issues compilation warnings : "tmp1 is used uninitialized in this function" __m128 tmp1; tmp1 = _mm_loadh_pi(_mm_loadl_pi(tmp1, (__m64*)(src)), (__m64*)(src+ 4)); Which makes sense in a

Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled

家住魔仙堡 提交于 2019-12-02 01:35:48
I am trying to understand the benefit of using SIMD vectorization and wrote a simple demonstrator code to see what would be the speed gain of an algorithm leveraging vectorization (SIMD) over another. Here are the 2 algorithms: Alg_A - No Vector support: #include <stdio.h> #define SIZE 1000000000 int main() { printf("Algorithm with NO vector support\n"); int a[] = {1, 2, 3, 4}; int b[] = {5, 6, 7, 8}; int i = 0; printf("Running loop %d times\n", SIZE); for (; i < SIZE; i++) { a[0] = a[0] + b[0]; a[1] = a[1] + b[1]; a[2] = a[2] + b[2]; a[3] = a[3] + b[3]; } printf("A: [%d %d %d %d]\n", a[0], a

Compare 16 byte strings with SSE

99封情书 提交于 2019-12-02 00:14:24
I have 16 byte 'strings' (they may be shorter but you may assume that they are padded with zeros at the end), but you may not assume they are 16 byte aligned (at least not always). How to write a routine that will compare them (for equality) with SSE intrinsics? I found this code fragment that could be of help but I', not sure if it is appropriate? register __m128i xmm0, xmm1; register unsigned int eax; xmm0 = _mm_load_epi128((__m128i*)(a)); xmm1 = _mm_load_epi128((__m128i*)(b)); xmm0 = _mm_cmpeq_epi8(xmm0, xmm1); eax = _mm_movemask_epi8(xmm0); if(eax==0xffff) //equal else //not equal Could

gcc 4.8 AVX optimization bug: extra code insertion?

末鹿安然 提交于 2019-12-02 00:06:56
It is great that gcc compiler 4.8 comes with AVX optimization with -Ofast option. However, I found an interesting but stupid bug, that it adds additional computations which are unnecessary. Maybe I am wrong so can someone give me an explanation? The original C++ source code is as follows: #define N 1000007 float a[N],b[N],c[N],d[N],e[N]; int main(int argc, char *argv[]){ cout << a << ' ' << b << ' ' << c << endl; for(int x=0; x<N; ++x){ c[x] = 1/sqrt((a[x]+b[x]-c[x])*d[x]/e[x]); } return 0; } The code is compiled using g++ 4.8.4 in Ubuntu 14.04.3 x86_64: g++ -mavx avx.cpp -masm=intel -c -g -Wa

_mm_extract_epi8(…) intrinsic that takes a non-literal integer as argument

◇◆丶佛笑我妖孽 提交于 2019-12-01 22:45:51
问题 I've lately been using the SSE intrinsic int _mm_extract_epi8 (__m128i src, const int ndx) that, according to the reference "extracts an integer byte from a packed integer array element selected by index". This is exactly what I want. However, I determine the index via a _mm_cmpestri on a _m128i that performs a packed comparison of string data with explicit lengths and generates the index. The range of this index is 0..16 where 0..15 represents a valid index and 16 means that no index was

New AVX-instructions syntax

落花浮王杯 提交于 2019-12-01 22:23:47
问题 I had a C code written with some intel-intrinsincs. After I compiled it first with avx and then with ssse3 flags, I got two quite different assembly codes. E.g: AVX: vpunpckhbw %xmm0, %xmm1, %xmm2 SSSE3: movdqa %xmm0, %xmm2 punpckhbw %xmm1, %xmm2 It's clear that vpunpckhbw is just punpckhbw but using the avx three operand syntax. But is the latency and the throughput of the first instruction equivalent to the latency and the throughput of the last ones combined? Or does the answer depend on