simd | 易学教程

Websocket data unmasking / multi byte xor

阅读更多关于 Websocket data unmasking / multi byte xor

websocket spec defines unmasking data as j = i MOD 4 transformed-octet-i = original-octet-i XOR masking-key-octet-j where mask is 4 bytes long and unmasking has to be applied per byte. Is there a way to do this more efficiently, than to just loop bytes? Server running the code can assumed to be a Haswell CPU, OS is Linux with kernel > 3.2, so SSE etc are all present. Coding is done in C, but I can do asm as well if necessary. I'd tried to look up the solution myself, but was unable to figure out if there was an appropriate instruction in any of the dozens of SSE1-5/AVE/(whatever extension -

x86 CPU Dispatching for SSE/AVX in C++

阅读更多关于 x86 CPU Dispatching for SSE/AVX in C++

I have an algorithm which benefits from hand optimisation with SSE(2) intrinsics. Moreover, the algorithm will also be able to benefit from the 256-bit AVX registers in the future. My question is what is the best way to Register the availability variants of my class at compile time; so if my classes are, say: Foo , FooSSE2 and FooAVX I require a means of determining at runtime what classes are compiled in. Determine the capabilities of the current CPU. At the lowest level this will result in a cpuid call. Decide at runtime what to use based on what is compiled and what is supported. While I

Porting MMX/SSE instructions to AltiVec

阅读更多关于 Porting MMX/SSE instructions to AltiVec

Let me preface this with.. I have extremely limited experience with ASM, and even less with SIMD. But it happens that I have the following MMX/SSE optimised code, that I would like to port across to AltiVec instructions for use on PPC/Cell processors. This is probably a big ask.. Even though it's only a few lines of code, I've had no end of trouble trying to work out what's going on here. The original function: static inline int convolve(const short *a, const short *b, int n) { int out = 0; union { __m64 m64; int i32[2]; } tmp; tmp.i32[0] = 0; tmp.i32[1] = 0; while (n >= 4) { tmp.m64 = _mm_add

“Extend” data type size in SSE register

阅读更多关于 “Extend” data type size in SSE register

I'm using VS2005 (at work) and need an SSE intrinsic that does the following: I have a pre-existing __m128i n filled with 16 bit integers a_1,a_2,....,a_8 . Since some calculations that I now want to do require 32 instead of 16 bits, I want to extract the two four-sets of 16-bit integers from n and put them into two separated __m128i s which contain a_1,...,a_4 and a_5,...,a_8 respectively. I could do this manually using the various _mm_set intrinsics, but those would result in eight mov s in assembly, and I'd hoped that there would be a faster way to do this. Assuming that I understand

Horizontal trailing maximum on AVX or SSE

阅读更多关于 Horizontal trailing maximum on AVX or SSE

I have an __m256i register consisting of 16bit values and I want to get the maximum values on each trailing element which are zeroes. To give an example: input: 1 0 0 3 0 0 4 5 0 0 0 0 4 3 0 2 output: 1 1 1 3 3 3 4 5 5 5 5 5 4 3 3 2 Are there any efficient way of doing this on AVX or AVX architecture? Maybe with log(16) = 4 iterations? Addition: Any solution on 128 bit numbers with 8 uint_16's in it are appreciated also. You can do this in log_2(SIMD_width) steps indeed. The idea is to shift the input vector x_vec two bytes. Then we blend x_vec with the shifted vector such that x_vec is

Explaining the different types in Metal and SIMD

阅读更多关于 Explaining the different types in Metal and SIMD

When working with Metal, I find there's a bewildering number of types and it's not always clear to me which type I should be using in which context. In Apple's Metal Shading Language Specification, there's a pretty clear table of which types are supported within a Metal shader file. However, there's plenty of sample code available that seems to use additional types that are part of SIMD. On the macOS (Objective-C) side of things, the Metal types are not available but the SIMD ones are and I'm not sure which ones I'm supposed to be used. For example: In the Metal Spec, there's float2 that is

Floating-point number vs fixed-point number: speed on Intel I5 CPU

阅读更多关于 Floating-point number vs fixed-point number: speed on Intel I5 CPU

I have a C/C++ program which involves intensive 32-bit floating-point matrix math computations such as addition, subtraction, multiplication, division, etc. Can I speed up my program by converting 32-bit floating-point numbers into 16-bit fixed-point numbers ? How much speed gain can I get ? Currently I'm working on a Intel I5 CPU. I'm using Openblas to perform the matrix calculations. How should I re-implement Openblas functions such as cblas_dgemm to perform fixed-point calculations ? I know that SSE(Simple SIMD Extensions) operates on 4x32=8x16=128 bit data at one time, i.e., 4 32-bit

determinant calculation with SIMD

阅读更多关于 determinant calculation with SIMD

Does there exist an approach for calculating the determinant of matrices with low dimensions (about 4), that works well with SIMD (neon, SSE, SSE2)? I am using a hand-expansion formula, which does not work so well. I am using SSE all the way to SSE3 and neon, both under linux. The matrix elements are all floats. Here's my 5 cents. determinant of a 2x2 matrix: that's an exercise for the reader, should be simple to implement determinant of a 3x3 matrix: use the scalar triple product. This will require smart cross() and dot() implementations. The recipes for these are widely available.

Determine the minimum across SIMD lanes of __m256 value

阅读更多关于 Determine the minimum across SIMD lanes of __m256 value

I understand that operations across SIMD lanes should generally be avoided. However, sometimes it has to be done. I am using AVX2 intrinsics, and have 8 floating point values in an __m256. I want to know the lowest value in this vector, and to complicate matters: also in which slot this was. My current solution makes a round trip to memory, which I don't like: float closestvals[8]; _mm256_store_ps( closestvals, closest8 ); float closest = closestvals[0]; int closestidx = 0; for ( int k=1; k<8; ++k ) { if ( closestvals[k] < closest ) { closest = closestvals[ k ]; closestidx = k; } } What would

Does .NET Framework 4.5 provide SSE4/AVX support?

阅读更多关于 Does .NET Framework 4.5 provide SSE4/AVX support?

I think, I heard about that, but don't know where. upd: I told about JiT it seem that it is coming. (I just found out an hour ago) here a few links The JIT finally proposed. JIT and SIMD are getting married. Update to SIMD Support you need the latest version of RyuJIT and Microsoft SIMD-enabled Vector Types (Nuget) No, there's no scenario in .NET where you can write machine code yourself. Code generation is entirely up to the just-in-time compiler. It is certainly capable of customizing its code generation based on the capabilities of the machine's processor. One of the big reasons why ngen