sse

NEON, SSE and interleaving loads vs shuffles

独自空忆成欢 提交于 2019-12-19 04:55:14
问题 I'm trying to understand the comment made by "Iwillnotexist Idonotexist" at SIMD optimization of cvtColor using ARM NEON intrinsics: ... why you don't use the ARM NEON intrisics that map to the VLD3 instruction? That spares you all of the shuffling, both simplifying and speeding up the code. The Intel SSE implementation requires shuffles because it lacks 2/3/4-way deinterleaving load instructions, but you shouldn't pass on them when they are available. The trouble I am having is the solution

Using SSE in C#

喜你入骨 提交于 2019-12-19 03:44:07
问题 I'm currently coding an application in C# which could benefit a great deal from using SSE, as a relative small piece of code causes 90-95% of the execution time. The code itself is also perfect for SSE (as it's matrix and vectorbased), so I went ahead and started to use Mono.Simd and even though this made a significant difference in execution time, this still isn't enough. The problem with Mono.Simd is that it only has very old SSE-instruction (mainly from SSE1 en SSE2, I believe), which

Can counting byte matches between two strings be optimized using SIMD?

随声附和 提交于 2019-12-19 00:38:52
问题 Profiling suggests that this function here is a real bottle neck for my application: static inline int countEqualChars(const char* string1, const char* string2, int size) { int r = 0; for (int j = 0; j < size; ++j) { if (string1[j] == string2[j]) { ++r; } } return r; } Even with -O3 and -march=native , G++ 4.7.2 does not vectorize this function (I checked the assembler output). Now, I'm not an expert with SSE and friends, but I think that comparing more than one character at once should be

Can counting byte matches between two strings be optimized using SIMD?

[亡魂溺海] 提交于 2019-12-19 00:38:29
问题 Profiling suggests that this function here is a real bottle neck for my application: static inline int countEqualChars(const char* string1, const char* string2, int size) { int r = 0; for (int j = 0; j < size; ++j) { if (string1[j] == string2[j]) { ++r; } } return r; } Even with -O3 and -march=native , G++ 4.7.2 does not vectorize this function (I checked the assembler output). Now, I'm not an expert with SSE and friends, but I think that comparing more than one character at once should be

Have different optimizations (plain, SSE, AVX) in the same executable with C/C++

萝らか妹 提交于 2019-12-18 16:12:08
问题 I'm developing optimizations for my 3D calculations and I now have: a " plain " version using the standard C language libraries, an SSE optimized version that compiles using a preprocessor #define USE_SSE , an AVX optimized version that compiles using a preprocessor #define USE_AVX Is it possible to switch between the 3 versions without having to compile different executables (ex. having different library files and loading the "right" one dynamically, don't know if inline functions are "right

Have different optimizations (plain, SSE, AVX) in the same executable with C/C++

纵然是瞬间 提交于 2019-12-18 16:12:02
问题 I'm developing optimizations for my 3D calculations and I now have: a " plain " version using the standard C language libraries, an SSE optimized version that compiles using a preprocessor #define USE_SSE , an AVX optimized version that compiles using a preprocessor #define USE_AVX Is it possible to switch between the 3 versions without having to compile different executables (ex. having different library files and loading the "right" one dynamically, don't know if inline functions are "right

Testing equality between two __m128i variables

↘锁芯ラ 提交于 2019-12-18 13:13:04
问题 If I want to do a bitwise equality test between two __m128i variables, am I required to use an SSE instruction or can I use == ? If not, which SSE instruction should I use? 回答1: Although using _mm_movemask_epi8 is one solution, if you have a processor with SSE4.1 I think a better solution is to use an instruction which sets the zero or carry flag in the FLAGS register. This saves a test or cmp instruction. To do this you could do this: if(_mm_test_all_ones(_mm_cmpeq_epi8(v1,v2))) { //v0 == v1

Why don't GCC and Clang use cvtss2sd [memory]?

一笑奈何 提交于 2019-12-18 13:12:52
问题 I'm trying to optimize some code that's supposed to read single precision floats from memory and perform arithmetic on them in double precision. This is becoming a significant performance bottleneck, as the code that stores data in memory as single precision is substantially slower than equivalent code that stores data in memory as double precision. Below is a toy C++ program that captures the essence of my issue: #include <cstdio> // noinline to force main() to actually read the value from

Can one construct a “good” hash function using CRC32C as a base?

谁说我不能喝 提交于 2019-12-18 12:46:11
问题 Given that SSE 4.2 (Intel Core i7 & i5 parts) includes a CRC32 instruction, it seems reasonable to investigate whether one could build a faster general-purpose hash function. According to this only 16 bits of a CRC32 are evenly distributed. So what other transformation would one apply to overcome that? Update How about this? Only 16 bits are suitable for a hash value. Fine. If your table is 65535 or less then great. If not, run the CRC value through the Nehalem POPCNT (population count)

How are denormalized floats handled in C#?

谁说我不能喝 提交于 2019-12-18 12:43:38
问题 Just read this fascinating article about the 20x-200x slowdowns you can get on Intel CPUs with denormalized floats (floating point numbers very close to 0). There is an option with SSE to round these off to 0, restoring performance when such floating point values are encountered. How do C# apps handle this? Is there an option to enable/disable _MM_FLUSH_ZERO ? 回答1: There is no such option. The FPU control word in a C# app is initialized by the CLR at startup. Changing it is not an option