intrinsics | 易学教程

Using ARM NEON intrinsics to add alpha and permute

阅读更多关于 Using ARM NEON intrinsics to add alpha and permute

I'm developing an iOS app that needs to convert images from RGB -> BGRA fairly quickly. I would like to use NEON intrinsics if possible. Is there a faster way than simply assigning the components? void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix) { numPix /= 8; //process 8 pixels at a time uint8x8_t alpha = vdup_n_u8 (0xff); for (int i=0; i<numPix; i++) { uint8x8x3_t rgb = vld3_u8 (src); uint8x8x4_t bgra; bgra.val[0] = rgb.val[2]; //these lines are slow bgra.val[1] = rgb.val[1]; //these lines are slow bgra.val[2] = rgb.val[0]; //these lines are slow bgra.val[3] =

Determine CPUID as listed in the Intel Intrinsics Guide

阅读更多关于 Determine CPUID as listed in the Intel Intrinsics Guide

In the Intel Intrinsics Guide there are 'Latency and Throughput Information' at the bottom of several Intrinsics, listing the performance for several CPUID(s). For example, the table in the Intrinsics Guide looks as follows for the Intrinsic _mm_hadd_pd : CPUID(s) Parameters Latency Throughput 0F_03 13 4 06_2A xmm1, xmm2 5 2 06_25/2C/1A/1E/1F/2E xmm1, xmm2 5 2 06_17/1D xmm1, xmm2 6 1 06_0F xmm1, xmm2 5 2 Now: How do I determine, what ID my CPU has? I'm using Kubuntu 12.04 and tried with sudo dmidecode -t 4 and also with the little program cpuid from the Ubuntu packages, but their output isn't

Best assembly or compilation for minimum of three values

阅读更多关于 Best assembly or compilation for minimum of three values

I'm looking at code generated by GCC-4.8 for x86_64 and wondering if there is a better (faster) way to compute the minimum of three values. Here's an excerpt from Python's collections module that computes the minimum of m , rightindex+1 , and leftindex : ssize_t m = n; if (m > rightindex + 1) m = rightindex + 1; if (m > leftindex) m = leftindex; GCC generates serially dependent code with CMOVs: leaq 1(%rbp), %rdx cmpq %rsi, %rdx cmovg %rsi, %rdx cmpq %rbx, %rdx cmovg %rbx, %rdx Is there faster code that can take advantage of processor out-of-order parallel execution by removing the data

SSE 4 popcount for 16 8-bit values?

阅读更多关于 SSE 4 popcount for 16 8-bit values?

I have the following code which compiles with GCC using the flag -msse4 but the problem is that the pop count only gets the last four 8-bits of the converted __m128i type. Basically what I want is to count all 16 numbers inside the __m128i type but I'm not sure what intrinsic function call to make after creating the variable popA . Somehow popA has to be converted into an integer that contains all the 128-bits of information? I suppose theres _mm_cvtsi128_si64 and using a few shuffle few operations but my OS is 32-bit. Is there only the shuffle method and using _mm_cvtsi128_si32 ? EDIT: If the

_mm_alignr_epi8 (PALIGNR) equivalent in AVX2

阅读更多关于 _mm_alignr_epi8 (PALIGNR) equivalent in AVX2

In SSE3, the PALIGNR instruction performs the following: PALIGNR concatenates the destination operand (the first operand) and the source operand (the second operand) into an intermediate composite, shifts the composite at byte granularity to the right by a constant immediate, and extracts the right-aligned result into the destination. I'm currently in the midst of porting my SSE4 code to use AVX2 instructions and working on 256bit registers instead of 128bit. Naively, I believed that the intrinsics function _mm256_alignr_epi8 (VPALIGNR) performs the same operation as _mm_alignr_epi8 only on

Make compiler copy characters using movsd

阅读更多关于 Make compiler copy characters using movsd

I would like to copy a relatively short sequence of memory (less than 1 KB, typically 2-200 bytes) in a time critical function. The best code for this on CPU side seems to be rep movsd . However I somehow cannot make my compiler to generate this code. I hoped (and I vaguely remember seeing so) using memcpy would do this using compiler built-in intrinsics, but based on disassembly and debugging it seems compiler is using call to memcpy/memmove library implementation instead. I also hoped the compiler might be smart enough to recognize following loop and use rep movsd on its own, but it seems it

what's the difference between _mm256_lddqu_si256 and _mm256_loadu_si256

阅读更多关于 what's the difference between _mm256_lddqu_si256 and _mm256_loadu_si256

问题 I had been using _mm256_lddqu_si256 based on an example I found online. Later I discovered _mm256_loadu_si256 . The Intel Intrinsics guide only states that the lddqu version may perform better when crossing a cache line boundary. What might be the advantages of loadu ? In general how are these functions different? 回答1: There's no reason to ever use _mm256_lddqu_si256 , consider it a synonym for _mm256_loadu_si256 . lddqu only exists for historical reasons as x86 evolved towards having better

How can I get an intrinsic for the exp() function in x64 code?

阅读更多关于 How can I get an intrinsic for the exp() function in x64 code?

I have the following code and am expecting the intrinsic version of the exp() function to be used. Unfortunately, it is not in an x64 build, making it slower than a similar Win32 (i.e., 32-bit build): #include "stdafx.h" #include <cmath> #include <intrin.h> #include <iostream> int main() { const int NUM_ITERATIONS=10000000; double expNum=0.00001; double result=0.0; for (double i=0;i<NUM_ITERATIONS;++i) { result+=exp(expNum); // <-- The code of interest is here expNum+=0.00001; } // To prevent the above from getting optimized out... std::cout << result << '\n'; } I am using the following

Why and when to use __noop?

阅读更多关于 Why and when to use __noop?

I was reading about __noop and the MSDN example is #if DEBUG #define PRINT printf_s #else #define PRINT __noop #endif int main() { PRINT("\nhello\n"); } and I don't see the gain over just having an empty macro: #define PRINT The generated code is the same. What's a valid example of using __noop that actually makes it useful? The __noop intrinsic specifies that a function should be ignored and the argument list be parsed but no code be generated for the arguments. It is intended for use in global debug functions that take a variable number of arguments. In your case the argument is an obviously

How do initialize an SIMD vector with a range from 0 to N?

阅读更多关于 How do initialize an SIMD vector with a range from 0 to N?

I have the following function I'm trying to write an AXV version for: void hashids_shuffle(char *str, size_t str_length, char *salt, size_t salt_length) { size_t i, j, v, p; char temp; if (!salt_length) { return; } for (i = str_length - 1, v = 0, p = 0; i > 0; --i, ++v) { v %= salt_length; p += salt[v]; j = (salt[v] + v + p) % i; temp = str[i]; str[i] = str[j]; str[j] = temp; } } I'm trying to vectorize v %= salt_length; . I want to initialize a vector that contains numbers from 0 to str_length in order to use SVML's _mm_rem_epu64 in order to calculate v for each loop iteration. How do I