Using ARM NEON intrinsics to add alpha and permute
I'm developing an iOS app that needs to convert images from RGB -> BGRA fairly quickly. I would like to use NEON intrinsics if possible. Is there a faster way than simply assigning the components? void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix) { numPix /= 8; //process 8 pixels at a time uint8x8_t alpha = vdup_n_u8 (0xff); for (int i=0; i<numPix; i++) { uint8x8x3_t rgb = vld3_u8 (src); uint8x8x4_t bgra; bgra.val[0] = rgb.val[2]; //these lines are slow bgra.val[1] = rgb.val[1]; //these lines are slow bgra.val[2] = rgb.val[0]; //these lines are slow bgra.val[3] =