
Is there an advantage of specifying “-mfpu=neon-vfpv3” over “-mfpu=neon” for ARMs with separate pipelines?

问题 My Zynq-7000 ARM Cortex-A9 Processor has both the NEON and the VFPv3 extension and the Zynq-7000-TRM says that the processor is configured to have "Independent pipelines for VFPv3 and advanced SIMD instructions" . So far I compiled my programs with Linaro GCC 6.3-2017.05 and the -mfpu=neon option, to make use of SIMD instructions. But in the case that the compiler also has non-SIMD operations to be issued, will it make a difference to use -mfpu=neon-vfpv3 ? Will GCC's instruction selection


Using ARM NEON intrinsics to add alpha and permute

问题 I'm developing an iOS app that needs to convert images from RGB -> BGRA fairly quickly. I would like to use NEON intrinsics if possible. Is there a faster way than simply assigning the components? void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix) { numPix /= 8; //process 8 pixels at a time uint8x8_t alpha = vdup_n_u8 (0xff); for (int i=0; i<numPix; i++) { uint8x8x3_t rgb = vld3_u8 (src); uint8x8x4_t bgra; bgra.val[0] = rgb.val[2]; //these lines are slow bgra.val[1]

RGBA to ABGR: Inline arm neon asm for iOS/Xcode

问题 This code(very similar code, haven't tried exactly this code) compiles using Android NDK, but not with Xcode/armv7+arm64/iOS Errors in comments: uint32_t *src; uint32_t *dst; #ifdef __ARM_NEON __asm__ volatile( "vld1.32 {d0, d1}, [%[src]] \n" // error: Vector register expected "vrev32.8 q0, q0 \n" // error: Unrecognized instruction mnemonic "vst1.32 {d0, d1}, [%[dst]] \n" // error: Vector register expected : : [src]"r"(src), [dst]"r"(dst) : "d0", "d1" ); #endif What's wrong with this code?

How to OR all lane of a NEON vector

问题 I want to use NEON intrinsics to optimize the following code. uint32x4_t c1; // 4 elements, each element is 0 or 1 uint32x4_t c2; // 4 elements, each element is 0 or 1 uint8_t pack = 0; // unsigned char, for result /* some code /* // need optimizing pack |= (vgetq_lane_u32(c1, 0); pack |= (vgetq_lane_u32(c1, 1) << 1; pack |= (vgetq_lane_u32(c1, 2) << 2; pack |= (vgetq_lane_u32(c1, 3) << 3; pack |= (vgetq_lane_u32(c2, 0) << 4; pack |= (vgetq_lane_u32(c2, 1) << 5; pack |= (vgetq_lane_u32(c2, 2)

Is there a way to detect VFP/NEON/Thumb/… on iOS at runtime?

问题 So it's fairly easy to figure out what kind of CPU an iOS device runs by querying sysctlbyname("hw.cpusubtype", ...) , but there seems to be no obvious way to figure out what features the CPU actually has (think VFP, NEON, Thumb, ...). Can someone think of a way to do this? Basically, what I need is something similar to getauxval(AT_HWCAP) on Linux/Android, which returns a bit mask of features supported by the CPU. A few things to note: The information must be retrieved at runtime from the OS

ARM and NEON can work in parallel?

问题 This is with reference to question: Checksum code implementation for Neon in Intrinsics Opening the sub-questions listed in the link as separate individual questions. As multi questions aren't to be asked as a part of single thread. Anyway coming to the question: Can ARM and NEON (speaking in terms of arm cortex-a8 architecture) actually work in parallel? How can I achieve this? Could someone point to me or share some sample implementations(pseudo-code/algorithms/code, not the theoretical

warning: format '%ld' expects argument of type 'long int', but argument has type '__builtin_neon_di'

问题 Wrt my this question,I am not able to cross check the output . I am getting some wrong print statement after execution .Can someone tell me whether printf() statements are wrong or logic that I am doing is wrong . CODE: int64_t arr[2] = {227802,9896688}; int64x2_t check64_2 = vld1q_s64(arr); for(int i = 0;i < 2; i++){ printf("check64_2[%d]: %ld\n",i,check64_2[i]); } int64_t way1 = check64_2[0] + check64_2[1]; int64x1_t way2 = vset_lane_s64(vgetq_lane_s64(check64_2, 0) + vgetq_lane_s64(check64