sse | 易学教程

Is it safe to compile one source with SSE2 another with AVX architecture?

阅读更多关于 Is it safe to compile one source with SSE2 another with AVX architecture?

问题 I'm using AVX intrinsics, but since for everything other than _mm256 based intrinsics MSVC generates non-vex instructions, I need to compiler the whole source code with /arch:AVX. The rest of the project is compiled with /arch:SSE2, so that it works on older CPUs and I'm manually checking if AVX is available. The source containing AVX code (compiled for AVX) includes a huge library of templates and other stuff, just to have the definitions. Is there a possibility that the compiler/linker

SSE memory access

阅读更多关于 SSE memory access

问题 I need to perform Gaussian Elimination using SSE and I am not sure how to access each element(32 bits) from the 128 bit registers(each storing 4 elements). This is the original code(without using SSE): unsigned int i, j, k; for (i = 0; i < num_elements; i ++) /* Copy the contents of the A matrix into the U matrix. */ for(j = 0; j < num_elements; j++) U[num_elements * i + j] = A[num_elements*i + j]; for (k = 0; k < num_elements; k++){ /* Perform Gaussian elimination in place on the U matrix. *

can't find materials about SSE2, Altivec, VMX on apple developer

阅读更多关于 can't find materials about SSE2, Altivec, VMX on apple developer

问题 as Paul. R sugguested that there are plenty of resources about SSE2 , AVX on apple developer but I couldn't find it. Could anyone helps me ? BTW, I also looking for the archive of mail-list of altivec. thanks! Intel SSE and AVX Examples and Tutorials 来源： https://stackoverflow.com/questions/22978362/cant-find-materials-about-sse2-altivec-vmx-on-apple-developer

__int128 alignment segment fault with gcc -O SSE optimize

阅读更多关于 __int128 alignment segment fault with gcc -O SSE optimize

问题 I use __int128 as struct's member. It works find with -O0 (no optimization). However it crashes for segment fault if optimization enabled ( -O1 ). It crashes at instruction movdqa , which need the var aligned by 16. While the address is allocated by malloc() which align only by 8. I tried to disable SSE optimization by -mno-sse , but it fails to compile: /usr/include/x86_64-linux-gnu/bits/stdlib-float.h:27:1: error: SSE register return with SSE disabled So what can I do if I want to use _

Why can't I use _mm_sin_pd? [duplicate]

阅读更多关于 Why can't I use _mm_sin_pd? [duplicate]

问题 This question already has answers here : C++ error: ‘_mm_sin_ps’ was not declared in this scope (3 answers) how can I use SVML instructions [duplicate] (1 answer) Where is Clang's '_mm256_pow_ps' intrinsic? (1 answer) Closed 11 months ago . Specifics says: __m128d _mm_sin_pd (__m128d a) #include <immintrin.h> CPUID Flags: SSE Description Compute the sine of packed double-precision (64-bit) floating-point elements in a expressed in radians, and store the results in dst. But it seems it is not

vector * matrix product efficiency issue

阅读更多关于 vector * matrix product efficiency issue

问题 Just as Z boson recommended, I am using a column-major matrix format in order to avoid having to use the dot product. I don't see a feasible way to avoid it when multiplying a vector with a matrix, though. The matrix multiplication trick requires efficient extraction of rows (or columns, if we transpose the product). To multiply a vector by a matrix, we therefore transpose: (b * A)^T = A^T * b^T A is a matrix, b a row vector, which, after being transposed, becomes a column vector. Its rows

Efficiently Building Summed Area Table

阅读更多关于 Efficiently Building Summed Area Table

问题 I am trying to construct a summed area table for later use in an adaptive thresholding routine. Since this code is going to be used in time critical software, I am trying to squeeze as many cycles as possible out of it. For performance, the table is unsigned integers for every pixel. When I attach my profiler, I am showing that my largest performance bottleneck occurs when performing the x-pass. The simple math expression for the computation is: sat_[y * width + x] = sat_[y * width + x - 1] +

gcc (6.1.0) using 'wrong' instructions in SSE intrinsics

阅读更多关于 gcc (6.1.0) using 'wrong' instructions in SSE intrinsics

问题 Background : I develop a computationally intensive tool, written in C/C++, that has to be able to be run on a variety of different x86_64 processors. To speed the calculations which are both float and integer, the code contains rather a lot of SSE* intrinsics with different paths tailored to different CPU SSE capabilities. (As the CPU flags are detected at the start of the program and used to set Booleans, I've assumed that the branch prediction for the tailored blocks of code will work very

Extract 4 SSE integers to 4 chars

阅读更多关于 Extract 4 SSE integers to 4 chars

问题 Suppose I have a __m128i containing 4 32-bit integer values. Is there some way I can store it inside a char[4] , where the lower char from each int value is stored in a char value? Desired result: r1 r2 r3 r4 __m128i 0x00000012 0x00000034 0x00000056 0x00000078 | V char[4] 0x12 0x34 0x56 0x78 SSE2 and below is preferred. Compiling on MSVC++. 回答1: With SSE2 you can use the following code: char[4] array; x = _mm_packs_epi32(x, x); x = _mm_packus_epi16(x, x); *((int*)array) = _mm_cvtsi128_si32(x)

How can I improve performance compiling for SSE and AVX?

阅读更多关于 How can I improve performance compiling for SSE and AVX?

问题 My new PC has a Core i7 CPU and I am running my benchmarks, including newer versions that use AVX instructions. I have installed Visual Studio 2013 to use a newer compiler, as my last one could not fully compile for full SSE SIMD operation. Below is some code used in one of my benchmarks (MPMFLOPS), and compile and link commands used. Tests were run with the first command to use SSE instructions. When xtra is 16 or less, the benchmark produces 24.4 GFLOPS. CPU runs at 3.9 GHz, so result is