simd | 易学教程

Generate vector code from Haskell?

阅读更多关于 Generate vector code from Haskell?

问题 Is it possible to get GHC to produce SIMD code for the various SSE generations? Eg. got a program like this import Data.Array.Vector main = print . sumU $ (enumFromToFracU 1 10000000 :: UArr Double) I can see the generated code (compiled for 64 bit x86) use SSE instructions in scalar mode (both C and asm backends). So addsd rather than addpd. For the types of programs I work on the use of vector instructions is important for performance. Is there an easy way for a newbie such as myself to get

Slower SSE performance on large array sizes

阅读更多关于 Slower SSE performance on large array sizes

问题 I am new to SSE programming so I am hoping someone out there can help me. I recently implemented a function using GCC SSE intrinsics to compute the sum of an array of 32-bit integers. The code for my implementation is given below. int ssum(const int *d, unsigned int len) { static const unsigned int BLOCKSIZE=4; unsigned int i,remainder; int output; __m128i xmm0, accumulator; __m128i* src; remainder = len%BLOCKSIZE; src = (__m128i*)d; accumulator = _mm_loadu_si128(src); output = 0; for(i

An accumulated computing error in SSE version of algorithm of the sum of squared differences

阅读更多关于 An accumulated computing error in SSE version of algorithm of the sum of squared differences

问题 I was trying to optimize following code (sum of squared differences for two arrays): inline float Square(float value) { return value*value; } float SquaredDifferenceSum(const float * a, const float * b, size_t size) { float sum = 0; for(size_t i = 0; i < size; ++i) sum += Square(a[i] - b[i]); return sum; } So I performed optimization with using of SSE instructions of CPU: inline void SquaredDifferenceSum(const float * a, const float * b, size_t i, __m128 & sum) { __m128 _a = _mm_loadu_ps(a +

ARM NEON SIMD version 2

阅读更多关于 ARM NEON SIMD version 2

问题 What is the difference between NEON SIMD and NEON SIMD version 2 as in Cortex A15? 回答1: It adds SIMD FMA instruction (VFMA.F32) and also mandates NEON half precision extension. NEONv2 is supported in ARM Cortex-A7, ARM Cortex-A15, and Qualcomm Krait (not sure about ARM Cortex-A5). 回答2: It is not that much of a difference, from ARM ARM: (in reverse order of definitions) Advanced SIMDv2 is an OPTIONAL extension to the ARMv7-A and ARMv7-R profiles. Advanced SIMDv2 adds both the Half-precision

neon float multiplication is slower than expected

阅读更多关于 neon float multiplication is slower than expected

问题 I have two tabs of floats. I need to multiply elements from the first tab by corresponding elements from the second tab and store the result in a third tab. I would like to use NEON to parallelize floats multiplications: four float multiplications simultaneously instead of one. I have expected significant acceleration but I achieved only about 20% execution time reduction. This is my code: #include <stdlib.h> #include <iostream> #include <arm_neon.h> const int n = 100; // table size /* fill a

Is vec_sld endian sensitive?

阅读更多关于 Is vec_sld endian sensitive?

问题 I'm working on a PowerPC machine with in-core crypto. I'm having trouble porting AES key expansion from big endian to little endian using built-ins. Big endian works, but little endian does not. The algorithm below is the snippet presented in an IBM blog article. I think I have the issue isolated to line 2 below: typedef __vector unsigned char uint8x16_p8; uint8x64_p8 r0 = {0}; r3 = vec_perm(r1, r1, r5); /* line 1 */ r6 = vec_sld(r0, r1, 12); /* line 2 */ r3 = vcipherlast(r3, r4); /* line 3 *

Fastest 64-bit population count (Hamming weight)

阅读更多关于 Fastest 64-bit population count (Hamming weight)

问题 I had to calculate the Hamming weight for a quite fast continious flow of 64-bit data and using the popcnt assembly instruction throws me a exception om my Intel Core i7-4650U. I checked my bible Hacker's delight and scanned the web for all kinds of algorithms (it's a bunch out there since they started tackling this 'problem' at the birth of computing). I spent the weekend playing around with some ideas of my own and came up with these algorithms, where I'm almost at the speed I can move data

How can I set __m128i without using of any SSE instruction?

阅读更多关于 How can I set __m128i without using of any SSE instruction?

问题 I have many function which use the same constant __m128i values. For example: const __m128i K8 = _mm_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16); const __m128i K16 = _mm_setr_epi16(1, 2, 3, 4, 5, 6, 7, 8); const __m128i K32 = _mm_setr_epi32(1, 2, 3, 4); So I want to store all these constants in an one place. But there is a problem: I perform checking of existed CPU extension in run time. If the CPU doesn't support for example SSE (or AVX) than will be a program crash

How many cycle does need for put a data into SIMD register?

阅读更多关于 How many cycle does need for put a data into SIMD register?

问题 I'm a student who learning x86 and ARM architecture. And I was wondering that how many cycle does need for putting multiple datas into SIMD registers? I understand that x86 SSE's xmms register has 128 bit size of register. What if I want to put 32 of 8 bit of data into one of xmms register from the stack via SIMD instruction set and via assembly language, does it have same amount of cycle time for general purpose register's PUSH/POP? or does it needs 32x of time for each 8bit of data? Thank

Are older SIMD-versions available when using newer ones?

阅读更多关于 Are older SIMD-versions available when using newer ones?

问题 When I can use SSE3 or AVX, are then older SSE versions as SSE2 or MMX available - or do I still need to check for them separately? 回答1: In general, these have been additive but keep in mind that there are differences between Intel and AMD support for these over the years. If you have AVX, then you can assume SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE 4.2 as well. Remember that to use AVX you also need to validate the OSXSAVE CPUID bit is set to ensure the OS you are using actually supports