sse | 易学教程

NEON, SSE and interleaving loads vs shuffles

阅读更多关于 NEON, SSE and interleaving loads vs shuffles

问题 I'm trying to understand the comment made by "Iwillnotexist Idonotexist" at SIMD optimization of cvtColor using ARM NEON intrinsics: ... why you don't use the ARM NEON intrisics that map to the VLD3 instruction? That spares you all of the shuffling, both simplifying and speeding up the code. The Intel SSE implementation requires shuffles because it lacks 2/3/4-way deinterleaving load instructions, but you shouldn't pass on them when they are available. The trouble I am having is the solution

Using SSE in C#

阅读更多关于 Using SSE in C#

问题 I'm currently coding an application in C# which could benefit a great deal from using SSE, as a relative small piece of code causes 90-95% of the execution time. The code itself is also perfect for SSE (as it's matrix and vectorbased), so I went ahead and started to use Mono.Simd and even though this made a significant difference in execution time, this still isn't enough. The problem with Mono.Simd is that it only has very old SSE-instruction (mainly from SSE1 en SSE2, I believe), which

Can counting byte matches between two strings be optimized using SIMD?

阅读更多关于 Can counting byte matches between two strings be optimized using SIMD?

问题 Profiling suggests that this function here is a real bottle neck for my application: static inline int countEqualChars(const char* string1, const char* string2, int size) { int r = 0; for (int j = 0; j < size; ++j) { if (string1[j] == string2[j]) { ++r; } } return r; } Even with -O3 and -march=native , G++ 4.7.2 does not vectorize this function (I checked the assembler output). Now, I'm not an expert with SSE and friends, but I think that comparing more than one character at once should be

Can counting byte matches between two strings be optimized using SIMD?

阅读更多关于 Can counting byte matches between two strings be optimized using SIMD?

Have different optimizations (plain, SSE, AVX) in the same executable with C/C++

阅读更多关于 Have different optimizations (plain, SSE, AVX) in the same executable with C/C++

问题 I'm developing optimizations for my 3D calculations and I now have: a " plain " version using the standard C language libraries, an SSE optimized version that compiles using a preprocessor #define USE_SSE , an AVX optimized version that compiles using a preprocessor #define USE_AVX Is it possible to switch between the 3 versions without having to compile different executables (ex. having different library files and loading the "right" one dynamically, don't know if inline functions are "right

Have different optimizations (plain, SSE, AVX) in the same executable with C/C++

阅读更多关于 Have different optimizations (plain, SSE, AVX) in the same executable with C/C++

Testing equality between two __m128i variables

阅读更多关于 Testing equality between two __m128i variables

问题 If I want to do a bitwise equality test between two __m128i variables, am I required to use an SSE instruction or can I use == ? If not, which SSE instruction should I use? 回答1: Although using _mm_movemask_epi8 is one solution, if you have a processor with SSE4.1 I think a better solution is to use an instruction which sets the zero or carry flag in the FLAGS register. This saves a test or cmp instruction. To do this you could do this: if(_mm_test_all_ones(_mm_cmpeq_epi8(v1,v2))) { //v0 == v1

Why don't GCC and Clang use cvtss2sd [memory]?

阅读更多关于 Why don't GCC and Clang use cvtss2sd [memory]?

问题 I'm trying to optimize some code that's supposed to read single precision floats from memory and perform arithmetic on them in double precision. This is becoming a significant performance bottleneck, as the code that stores data in memory as single precision is substantially slower than equivalent code that stores data in memory as double precision. Below is a toy C++ program that captures the essence of my issue: #include <cstdio> // noinline to force main() to actually read the value from

Can one construct a “good” hash function using CRC32C as a base?

阅读更多关于 Can one construct a “good” hash function using CRC32C as a base?

问题 Given that SSE 4.2 (Intel Core i7 & i5 parts) includes a CRC32 instruction, it seems reasonable to investigate whether one could build a faster general-purpose hash function. According to this only 16 bits of a CRC32 are evenly distributed. So what other transformation would one apply to overcome that? Update How about this? Only 16 bits are suitable for a hash value. Fine. If your table is 65535 or less then great. If not, run the CRC value through the Nehalem POPCNT (population count)

How are denormalized floats handled in C#?

阅读更多关于 How are denormalized floats handled in C#?

问题 Just read this fascinating article about the 20x-200x slowdowns you can get on Intel CPUs with denormalized floats (floating point numbers very close to 0). There is an option with SSE to round these off to 0, restoring performance when such floating point values are encountered. How do C# apps handle this? Is there an option to enable/disable _MM_FLUSH_ZERO ? 回答1: There is no such option. The FPU control word in a C# app is initialized by the CLR at startup. Changing it is not an option