simd | 易学教程

Faster approximate reciprocal square root of an array

阅读更多关于 Faster approximate reciprocal square root of an array

问题 How to calculate approximate reciprocal square root of an array faster on a cpu with popcnt and SSE4.2? The input is positive integers (ranges from 0 to about 200,000) stored in an array of floats. The output is an array of floats. Both arrays have correct memory alignment for sse. The code below only use 1 xmm register, runs on linux, and can be compiled by gcc -O3 code.cpp -lrt -msse4.2 Thank you. #include <iostream> #include <emmintrin.h> #include <time.h> using namespace std; void print

Optimizing SIMD histogram calculation

阅读更多关于 Optimizing SIMD histogram calculation

问题 I worked on a code that implements an histogram calculation given an opencv struct IplImage * and a buffer unsigned int * to the histogram. I'm still new to SIMD so I might not be taking advantage of the full potential the instruction set provides. histogramASM: xor rdx, rdx xor rax, rax mov eax, dword [imgPtr + imgWidthOffset] mov edx, dword [imgPtr + imgHeightOffset] mul rdx mov rdx, rax ; rdx = Image Size mov r10, qword [imgPtr + imgDataOffset] ; r10 = ImgData NextPacket: mov rax, rdx

Use load/store correctly

阅读更多关于 Use load/store correctly

问题 How to use load/store to do aligned int16_t byte swapping correctly? void byte_swapping(uint16_t* dest, const uint16_t* src, size_t count) { __m128i _s, _d; for (uint16_t const * end(dest + count); dest != end; dest += 8, src += 8) { _s = _mm_load_si128((__m128i*)src); _d = _mm_or_si128(_mm_slli_epi16(_s, 8), _mm_srli_epi16(_s, 8)); _mm_store_si128((__m128i*) dest, _d); } } 回答1: Your code will fail when count is not a multiple of 8, or when either src or dest is not 16 byte aligned. Here is a

SIMD intrinsics - are they usable on gpus?

阅读更多关于 SIMD intrinsics - are they usable on gpus?

问题 I'm wondering if I can use SIMD intrinsics in a GPU code like a CUDA's kernel or openCL one. Is that possible? 回答1: No, SIMD intrinsics are just tiny wrappers for ASM code. They are CPU specific. More about them here. Generally speking, why whould you do that? CUDA and OpenCL already contain many "functions" which are actually "GPU intrinsics" (all of these, for example, are single-point-math intrinsics for the GPU) 回答2: You use the vector data types built into the OpenCL C language. For

SSE optimized code performs similar to plain version

阅读更多关于 SSE optimized code performs similar to plain version

问题 I wanted to take my first steps with Intel's SSE so I followed the guide published here, with the difference that instead of developing for Windows and C++ I make it for Linux and C (therefore I don't use any _aligned_malloc but posix_memalign ). I also implemented one computing intensive method without making use of the SSE extensions. Surprisingly, when I run the program both pieces of code (that one with SSE and that one without) take similar amounts of time to run, usually being the time

Array Error - Access violation reading location 0xffffffff

阅读更多关于 Array Error - Access violation reading location 0xffffffff

问题 I have previously used SIMD operators to improve the efficiency of my code, however I am now facing a new error which I cannot resolve. For this task, speed is paramount. The size of the array will not be known until the data is imported, and may be very small (100 values) or enormous (10 million values). For the latter case, the code works fine, however I am encountering an error when I use fewer than 130036 array values. Does anyone know what is causing this issue and how to resolve it? I

SSE: How to reduce a _m128i._i32[4] to _m128i._i8

阅读更多关于 SSE: How to reduce a _m128i._i32[4] to _m128i._i8

问题 I'm very new to SSE - coding: And i want to store the result of _m128i[4] with int32 type to one _m128i with int8 type. (The values of _m128i[j]._i32[k] are all between (-127 and + 127 ) I think in pseudo-code this is the following: result._i8 = { vec1._i8[0], vec1._i8[4], vec1._i8[8], vec1._i8[12], vec2._i8[0], vec2._i8[4], vec2._i8[8], vec2._i8[12], vec3._i8[0], vec3._i8[4], vec3._i8[8], vec3._i8[12], vec4._i8[0], vec4._i8[4], vec4._i8[8], vec4._i8[12]}; The only way i found is this messy

x86 CPU Dispatching for SSE/AVX in C++

阅读更多关于 x86 CPU Dispatching for SSE/AVX in C++

问题 I have an algorithm which benefits from hand optimisation with SSE(2) intrinsics. Moreover, the algorithm will also be able to benefit from the 256-bit AVX registers in the future. My question is what is the best way to Register the availability variants of my class at compile time; so if my classes are, say: Foo , FooSSE2 and FooAVX I require a means of determining at runtime what classes are compiled in. Determine the capabilities of the current CPU. At the lowest level this will result in

QWORD shuffle sequential 7-bits to byte-alignment with SIMD SSE…AVX

阅读更多关于 QWORD shuffle sequential 7-bits to byte-alignment with SIMD SSE…AVX

问题 I would like to know if the following is possible in any of the SIMD families of instructions. I have a qword input with 63 significant bits (never negative). Each sequential 7 bits starting from the LSB is shuffle-aligned to a byte, with a left-padding of 1 (except for the most significant non-zero byte). To illustrate, I'll use letters for clarity's sake. The result is only the significant bytes, thus 0 - 9 in size, which is converted to a byte array. In: 0|kjihgfe|dcbaZYX|WVUTSRQ|PONMLKJ

Websocket data unmasking / multi byte xor

阅读更多关于 Websocket data unmasking / multi byte xor

问题 websocket spec defines unmasking data as j = i MOD 4 transformed-octet-i = original-octet-i XOR masking-key-octet-j where mask is 4 bytes long and unmasking has to be applied per byte. Is there a way to do this more efficiently, than to just loop bytes? Server running the code can assumed to be a Haswell CPU, OS is Linux with kernel > 3.2, so SSE etc are all present. Coding is done in C, but I can do asm as well if necessary. I'd tried to look up the solution myself, but was unable to figure