sse | 易学教程

SSE optimized code performs similar to plain version

阅读更多关于 SSE optimized code performs similar to plain version

问题 I wanted to take my first steps with Intel's SSE so I followed the guide published here, with the difference that instead of developing for Windows and C++ I make it for Linux and C (therefore I don't use any _aligned_malloc but posix_memalign ). I also implemented one computing intensive method without making use of the SSE extensions. Surprisingly, when I run the program both pieces of code (that one with SSE and that one without) take similar amounts of time to run, usually being the time

Array Error - Access violation reading location 0xffffffff

阅读更多关于 Array Error - Access violation reading location 0xffffffff

问题 I have previously used SIMD operators to improve the efficiency of my code, however I am now facing a new error which I cannot resolve. For this task, speed is paramount. The size of the array will not be known until the data is imported, and may be very small (100 values) or enormous (10 million values). For the latter case, the code works fine, however I am encountering an error when I use fewer than 130036 array values. Does anyone know what is causing this issue and how to resolve it? I

SSE: How to reduce a _m128i._i32[4] to _m128i._i8

阅读更多关于 SSE: How to reduce a _m128i._i32[4] to _m128i._i8

问题 I'm very new to SSE - coding: And i want to store the result of _m128i[4] with int32 type to one _m128i with int8 type. (The values of _m128i[j]._i32[k] are all between (-127 and + 127 ) I think in pseudo-code this is the following: result._i8 = { vec1._i8[0], vec1._i8[4], vec1._i8[8], vec1._i8[12], vec2._i8[0], vec2._i8[4], vec2._i8[8], vec2._i8[12], vec3._i8[0], vec3._i8[4], vec3._i8[8], vec3._i8[12], vec4._i8[0], vec4._i8[4], vec4._i8[8], vec4._i8[12]}; The only way i found is this messy

x86 CPU Dispatching for SSE/AVX in C++

阅读更多关于 x86 CPU Dispatching for SSE/AVX in C++

问题 I have an algorithm which benefits from hand optimisation with SSE(2) intrinsics. Moreover, the algorithm will also be able to benefit from the 256-bit AVX registers in the future. My question is what is the best way to Register the availability variants of my class at compile time; so if my classes are, say: Foo , FooSSE2 and FooAVX I require a means of determining at runtime what classes are compiled in. Determine the capabilities of the current CPU. At the lowest level this will result in

QWORD shuffle sequential 7-bits to byte-alignment with SIMD SSE…AVX

阅读更多关于 QWORD shuffle sequential 7-bits to byte-alignment with SIMD SSE…AVX

问题 I would like to know if the following is possible in any of the SIMD families of instructions. I have a qword input with 63 significant bits (never negative). Each sequential 7 bits starting from the LSB is shuffle-aligned to a byte, with a left-padding of 1 (except for the most significant non-zero byte). To illustrate, I'll use letters for clarity's sake. The result is only the significant bytes, thus 0 - 9 in size, which is converted to a byte array. In: 0|kjihgfe|dcbaZYX|WVUTSRQ|PONMLKJ

Alignment of vectors in LLVM's amd64 output

阅读更多关于 Alignment of vectors in LLVM's amd64 output

问题 I'm trying to use vectors inside structs with LLVM. I have the following C definition of my struct: struct Foo { uint32_t len; uint32_t data[32] __attribute__ ((aligned (16))); }; and here's some LLVM code to add 42 to element number 3 of the data field: %Foo = type { i32, <32 x i32> } define void @process(%Foo*) { _L1: %data = getelementptr %Foo* %0, i32 0, i32 1 %vec = load <32 x i32>* %data %x = extractelement <32 x i32> %vec, i32 3 %xNew = add i32 42, %x %vecNew = insertelement <32 x i32>

Invoking native code with hand-written assembly

阅读更多关于 Invoking native code with hand-written assembly

问题 I'm trying to call a native function from a managed assembly. I've done this on pre-compiled libraries and everything has went well. At this moment I'm building my own library, and I can't get this work. The native DLL source is the following: #define DERM_SIMD_EXPORT __declspec(dllexport) #define DERM_SIMD_API __cdecl extern "C" { DERM_SIMD_EXPORT void DERM_SIMD_API Matrix4x4_Multiply_SSE(float *result, float *left, float *right); } void DERM_SIMD_API Matrix4x4_Multiply_SSE(float *result,

Websocket data unmasking / multi byte xor

阅读更多关于 Websocket data unmasking / multi byte xor

问题 websocket spec defines unmasking data as j = i MOD 4 transformed-octet-i = original-octet-i XOR masking-key-octet-j where mask is 4 bytes long and unmasking has to be applied per byte. Is there a way to do this more efficiently, than to just loop bytes? Server running the code can assumed to be a Haswell CPU, OS is Linux with kernel > 3.2, so SSE etc are all present. Coding is done in C, but I can do asm as well if necessary. I'd tried to look up the solution myself, but was unable to figure

How to determine SSE prefetch instruction size?

阅读更多关于 How to determine SSE prefetch instruction size?

问题 I am working with code which contains inline assembly for SSE prefetch instructions. A preprocessor constant determines whether the instructions for 32-, 64- or 128-bye prefetches are used. The application is used on a wide variety of platforms, and so far I have had to investigate in each case which is the best option for the given CPU. I understand that this is the cache line size. Is this information obtainable automatically? It doesn't seem to be explicitly present in /proc/cpuinfo. 回答1:

“Extend” data type size in SSE register

阅读更多关于 “Extend” data type size in SSE register

问题 I'm using VS2005 (at work) and need an SSE intrinsic that does the following: I have a pre-existing __m128i n filled with 16 bit integers a_1,a_2,....,a_8 . Since some calculations that I now want to do require 32 instead of 16 bits, I want to extract the two four-sets of 16-bit integers from n and put them into two separated __m128i s which contain a_1,...,a_4 and a_5,...,a_8 respectively. I could do this manually using the various _mm_set intrinsics, but those would result in eight mov s in