sse

How do declare a memory range as uncacheable using gcc on x86 platform?

眉间皱痕 提交于 2019-11-29 08:29:37
问题 Although I have read about movntdqa instructions regarding this but have figured out a clean way to express a memory range uncacheable or read data so as to not pollute the cache. I want to do this from gcc. My main goal is to swap to random locations in an large array. Hoping to accelerate this operation by avoiding caching since there is very little data resue. 回答1: I think what you're describing is Memory Type Range Registers. You can control these under Linux (if available and you're user

Fast memory transpose with SSE, AVX, and OpenMP

孤人 提交于 2019-11-29 07:53:08
问题 I need a fast memory transpose algorithm for my Gaussian convolution function in C/C++. What I do now is convolute_1D transpose convolute_1D transpose It turns out that with this method the filter size has to be large (or larger than I expected) or the transpose takes longer than the convolution (e.g. for a 1920x1080 matrix the convolution takes the same time as the transpose for a filter size of 35). The current transpose algorithm I am using uses loop blocking/tiling along with SSE and

What is the minimum supported SSE flag that can be enabled on macOS?

半世苍凉 提交于 2019-11-29 07:37:25
Most of the hardware I uses supports SSE2 these days. On Windows and Linux, I have some code to test SSE support. I read somewhere that macOS has supported SSE for a long time, but I don't know the minimum version that can be enabled. The final binary will be copied to other macOS platforms so I cannot use -march=native like with GCC. If it is enabled by default on all builds, do I have to pass -msse or -msse2 flags when building my code ? Here is my compiler version: Apple LLVM version 6.0 (clang-600.0.56) (based on LLVM 3.5svn) Target: x86_64-apple-darwin14.1.0 Thread model: posix Here is

How to allocate 16byte memory aligned data

久未见 提交于 2019-11-29 07:35:05
问题 I am trying to implement SSE vectorization on a piece of code for which I need my 1D array to be 16 byte memory aligned. However, I have tried several ways to allocate 16byte memory aligned data but it ends up being 4byte memory aligned. I have to work with the Intel icc compiler. This is a sample code I am testing with: #include <stdio.h> #include <stdlib.h> void error(char *str) { printf("Error:%s\n",str); exit(-1); } int main() { int i; //float *A=NULL; float *A = (float*) memalign(16,20

SSE intrinsics: Convert 32-bit floats to UNSIGNED 8-bit integers

二次信任 提交于 2019-11-29 07:30:11
Using SSE intrinsics, I've gotten a vector of four 32-bit floats clamped to the range 0-255 and rounded to nearest integer. I'd now like to write those four out as bytes. There is an intrinsic _mm_cvtps_pi8 that will convert 32-bit to 8-bit signed int, but the problem there is that any value over 127 gets clamped to 127. I can't find any instructions that will clamp to unsigned 8-bit values. I have an intuition that what I may want to do is some combination of _mm_cvtps_pi16 and _mm_shuffle_pi8 followed by move instruction to get the four bytes I care about into memory. Is that the best way to

How to write c++ code that the compiler can efficiently compile to SSE or AVX?

僤鯓⒐⒋嵵緔 提交于 2019-11-29 07:12:42
Let's say I have a function written in c++ that performs matrix vector multiplications on a lot of vectors. It takes a pointer to the array of vectors to transform. Am I correct to assume that the compiler cannot efficiently optimize that to SIMD instructions because it does not know the alignment of the passed pointer (requiring a 16 byte alignment for SSE or 32 byte alignment for AVX) at compile time? Or is the memory alignment of the data irrelevant for optimal SIMD code and the data alignment will only affect cache performance? If alignment is important for the generated code, how can I

pow for SSE types

走远了吗. 提交于 2019-11-29 07:11:18
I do some explicitly vectorised computations using SSE types, such as __m128 (defined in xmmintrin.h etc), but now I need to raise all elements of the vector to some (same) power, i.e. ideally I would want something like __m128 _mm_pow_ps(__m128, float) , which unfortunately doesn't exist. What is the best way around this? I could store the vector, call std::pow on each element, and then reload it. Is this the best I can do? How do compilers implement a call to std::pow when auto-vectorising code that otherwise is well vectorisable? Are there any libraries that provide something useful? (note

horizontal sum of 8 packed 32bit floats

别等时光非礼了梦想. 提交于 2019-11-29 05:16:32
If I have 8 packed 32-bit floating point numbers ( __m256 ), what's the fastest way to extract the horizontal sum of all 8 elements? Similarly, how to obtain the horizontal maximum and minimum? In other words, what's the best implementation for the following C++ functions? float sum(__m256 x); ///< returns sum of all 8 elements float max(__m256 x); ///< returns the maximum of all 8 elements float min(__m256 x); ///< returns the minimum of all 8 elements Quickly jotted down here (and hence untested): float sum(__m256 x) { __m128 hi = _mm256_extractf128_ps(x, 1); __m128 lo = _mm256_extractf128

SSE2 intrinsics - comparing unsigned integers

删除回忆录丶 提交于 2019-11-29 05:15:53
I'm interested in identifying overflowing values when adding unsigned 8-bit integers, and saturating the result to 0xFF: __m128i m1 = _mm_loadu_si128(/* 16 8-bit unsigned integers */); __m128i m2 = _mm_loadu_si128(/* 16 8-bit unsigned integers */); __m128i m3 = _mm_adds_epu8(m1, m2); I would be interested in performing comparison for less than on these unsigned integers, similar to _mm_cmplt_epi8 for signed: __m128i mask = _mm_cmplt_epi8 (m3, m1); m1 = _mm_or_si128(m3, mask); If an "epu8" equivalent was available, mask would have 0xFF where m3[i] < m1[i] (overflow!), 0x00 otherwise , and we

Compilation of a simple c++ program using SSE intrinsics

核能气质少年 提交于 2019-11-29 04:46:14
I am new to the SSE instructions and I was trying to learn them from this site: http://www.codeproject.com/Articles/4522/Introduction-to-SSE-Programming I am using the GCC compiler on Ubuntu 10.10 with an Intel Core i7 960 CPU Here is a code based on the article which I attempted: For two arrays of length ARRAY_SIZE it calculates fResult[i] = sqrt( fSource1[i]*fSource1[i] + fSource2[i]*fSource2[i] ) + 0.5 Here is the code #include <iostream> #include <iomanip> #include <ctime> #include <stdlib.h> #include <xmmintrin.h> // Contain the SSE compiler intrinsics #include <malloc.h> void