neon

How to optimize histogram statistics with neon intrinsics?

╄→尐↘猪︶ㄣ 提交于 2019-11-29 15:38:06
I want to optimize histogram statistic code with neon intrinsics.But I didn't succeed.Here is the c code: #define NUM (7*1024*1024) uint8 src_data[NUM]; uint32 histogram_result[256] = {0}; for (int i = 0; i < NUM; i++) { histogram_result[src_data[i]]++; } Historam statistic is more like serial processing.It's difficult to optimize with neon intrinsics.Does anyone know how to optimize?Thanks in advance. You can't vectorise the stores directly, but you can pipeline them, and you can vectorise the address calculation on 32-bit platforms (and to a lesser extent on 64-bit platforms). The first

Load 8bit uint8_t as uint32_t?

淺唱寂寞╮ 提交于 2019-11-29 14:55:37
问题 my image processing project works with grayscale images. I have ARM Cortex-A8 processor platform. I want to make use of the NEON. I have a grayscale image( consider the example below) and in my alogorithm, I have to add only the columns. How can I load four 8-bit pixel values in parallel, which are uint8_t , as four uint32_t into one of the 128-bit NEON registers? What intrinsic do I have to use to do this? I mean: I must load them as 32 bits because if you look carefully, the moment I do 255

How to stop GCC from breaking my NEON intrinsics?

﹥>﹥吖頭↗ 提交于 2019-11-29 10:44:55
I need to write optimized NEON code for a project and I'm perfectly happy to write assembly language, but for portability/maintainability I'm using NEON instrinsics. This code needs to be as fast as possible, so I'm using my experience in ARM optimization to properly interleave instructions and avoid pipe stalls. No matter what I do, GCC works against me and creates slower code full of stalls. Does anyone know how to have GCC get out of the way and just translate my intrinsics into code? Here's an example: I have a simple loop which negates and copies floating point values. It works with 4

ARM to C calling convention, NEON registers to save

吃可爱长大的小学妹 提交于 2019-11-29 01:13:16
问题 There is a similar post that covers regular registers. What about NEON registers. As far as I remember either top half or bottom half of registers have to be preserved across function calls. I can't find that info anywhere, can somebody clarify that? thanks From the AAPCS, §5.1.1 Core registers: r0-r3 are the argument and scratch registers; r0-r1 are also the result registers r4-r8 are callee-save registers r9 might be a callee-save register or not (on some variants of AAPCS it is a special

Fast sine/cosine for ARMv7+NEON: looking for testers…

邮差的信 提交于 2019-11-28 18:08:36
问题 Could somebody with access to an iPhone 3GS or a Pandora please test the following assembly routine I just wrote? It is supposed to compute sines and cosines really really fast on the NEON vector FPU. I know it compiles fine, but without adequate hardware I can't test it. If you could just compute a few sines and cosines and compare the results with those of sinf() and cosf() it would really help. Thanks! #include <math.h> /// Computes the sine and cosine of two angles /// in: angles = Two

Is there a good reference for ARM Neon intrinsics?

99封情书 提交于 2019-11-28 17:53:55
The ARM reference manual doesn't go into too much detail into the individual instructions ( http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/BABIIBBG.html ). Is there something that's a little more detailed? For more information on the instructions themselves, you need the Assembler Guide . The list you found there just shows the mapping from compiler intrinsics to assembly instructions. There's also the ARM C Language Extensions which provides details on the usage of the intrinsics (see chapter 12) that could be useful. There is now an HTML version of the NEON Intrinsics

Coding for ARM NEON: How to start?

浪尽此生 提交于 2019-11-28 16:52:55
问题 BACKGROUND (skip this if you like) Let me start by saying that I am no expert programmer. I am a young junior computer vision (CV) engineer, and I am fairly experienced in C++ programming mainly because of an extensive use of the great OpenCV2 C++ API. All I've learned was through the need to execute projects, the need to solve problems and meet deadlines, as it is the reality in the industry. Recently, we started developing CV software for embedded systems (ARM boards), and we do it using

RGBA to ABGR: Inline arm neon asm for iOS/Xcode

巧了我就是萌 提交于 2019-11-28 12:49:46
This code(very similar code, haven't tried exactly this code) compiles using Android NDK, but not with Xcode/armv7+arm64/iOS Errors in comments: uint32_t *src; uint32_t *dst; #ifdef __ARM_NEON __asm__ volatile( "vld1.32 {d0, d1}, [%[src]] \n" // error: Vector register expected "vrev32.8 q0, q0 \n" // error: Unrecognized instruction mnemonic "vst1.32 {d0, d1}, [%[dst]] \n" // error: Vector register expected : : [src]"r"(src), [dst]"r"(dst) : "d0", "d1" ); #endif What's wrong with this code? EDIT1: I rewrote the code using intrinsics: uint8x16_t x = vreinterpretq_u8_u32(vld1q_u32(src)); uint8x16

Rust 的 GUI 框架生态概览

假装没事ソ 提交于 2019-11-28 12:02:39
本文比较全面比较了目前主流的 Rust 的 GUI 框架的表现。其中 ++ 表示非常好,-- 表示非常差,而 o 表示处于平均水平。 参与对比的框架共有 8 个,详细的比较结果如下: Electron + Neon Electron + FFI Electron + NodeJS Cpp Addon Rust Program + Qt static Rust program + Qt dynamic Cpp program + Rust lib static + Qt static Cpp program + Rust lib static + Qt dynamic Gtk 易于构建 ++ ++ + -- + o + + 构建性能 ++ ++ ++ -- ++ -- ++ o 打包体积 - - - ++ + ++ + o 易于部署 ++ ++ ++ + o + o - Rust互操作性 + o - + + + + ++ 开发速度 ++ + - + -- + -- - 内存占用 o o o o o o o + CPU占用 - - - ++ ++ ++ ++ ++ 安全性 o o o + + + + + 外观 ++ ++ ++ + + + + o 响应式UI ++ ++ ++ o o o o - 框架稳定性 + + + -- -- -- -- - 平台支持 + + + ++ ++ ++

How to solve bad instruction `vadd.i16 q0,q0,q0' when attempting to check gcc for neon instruction

点点圈 提交于 2019-11-28 11:16:29
问题 Checking gcc supports failed for neon instruction vadd.i16 q0,q0,q0 test.c int main () { __asm__("vadd.i16 q0, q0, q0"); return 0; } arm-linux-androideabi-gcc test.c /tmp/ccfc8m0G.s: Assembler messages: /tmp/ccfc8m0G.s:24: Error: bad instruction `vadd.i16 q0,q0,q0' Tried with flags -mcpu=cortex-a8 -mfpu=neon but stil no success Above code was used to test gcc support for neon instruction. Actually i am trying to build x264 with NEON support for ARM platformAfter running configure script x264