neon | 易学教程

Is there an advantage of specifying “-mfpu=neon-vfpv3” over “-mfpu=neon” for ARMs with separate pipelines?

阅读更多关于 Is there an advantage of specifying “-mfpu=neon-vfpv3” over “-mfpu=neon” for ARMs with separate pipelines?

问题 My Zynq-7000 ARM Cortex-A9 Processor has both the NEON and the VFPv3 extension and the Zynq-7000-TRM says that the processor is configured to have "Independent pipelines for VFPv3 and advanced SIMD instructions" . So far I compiled my programs with Linaro GCC 6.3-2017.05 and the -mfpu=neon option, to make use of SIMD instructions. But in the case that the compiler also has non-SIMD operations to be issued, will it make a difference to use -mfpu=neon-vfpv3 ? Will GCC's instruction selection

在WIN10上不用CMake和MinGW编译Android版本的OpenCV

阅读更多关于在WIN10上不用CMake和MinGW编译Android版本的OpenCV

3 月，跳不动了？>>> 构建编译项目不需要安装CMake，MinGW等额外工具。只需要Android NDK和OpenCV源码。在WIN10上使用cmd窗口命令编译OpenCV步骤，操作版本为4.0.1：在OpenCV根目录下面新建一个文件夹，比如叫做build_cmd。因为OpenCV要求 CMAKE_BINARY_DIR 不和 CMAKE_SOURCE_DIR 是同一个文件夹，所以不能在根目录下直接执行cmake cd到build_cmd 执行如下命令，路径对应修改 E:\Android\Sdk\cmake\3.10.2.4988404\bin\cmake.exe ^ -DCMAKE_TOOLCHAIN_FILE=E:\Android\Sdk\ndk-bundle\build\cmake\android.toolchain.cmake ^ -DANDROID_NDK=E:\Android\Sdk\ndk-bundle ^ -DANDROID_ABI="arm64-v8a" ^ -DANDROID_SDK=E:\Android\Sdk ^ -DWITH_TBB=ON ^ -DCPU_BASELINE=NEON ^ -DCPU_DISPATCH=NEON ^ -DOPENCV_ENABLE_NONFREE=ON ^ -DBUILD_ANDROID_EXAMPLES=OFF

信号为E时，如何让语音识别脱“网”而出？

阅读更多关于信号为E时，如何让语音识别脱“网”而出？

3 月，跳不动了？>>> 欢迎大家前往腾讯云+社区，获取更多腾讯海量技术实践干货哦~ 本文由腾讯教育云发表于云+社区专栏一般没有网络时，语音识别是这样的 ▽ 而同等环境下，嵌入式语音识别，是这样的 ▽ 不仅可以帮您边说边识、出口成章，有个性化名字的时候也难不倒它。这就是嵌入式语音识别的魅力。本文将从微信智聆的嵌入式语音识别引擎的实现和优化，介绍嵌入式语音识别的技术选型。 01 语音识别，大体是这么来的语音识别，能让机器“听懂”人类的语音，把说话内容识别为对应文本。开始于上世纪50年代从最初的小词量孤立识别系统到如今的大词量连续识别系统语音识别系统的发展，性能得到显著的提升，主要得利于以下几个方面：大数据时代的到来深度神经网络在语音识别中的应用 GPU硬件的发展因此，语音识别逐步走向实用化和产品化语音输入法，语音智能助手，语音车载交互系统…… 可以说，语音识别是人类征服人工智能的前沿阵地，是目前机器翻译、自然语言理解、人机交互等的奠基石。然而，性能的提升基于服务端CPU/GPU高计算能力和大内存，没有网络的时候将无法享受语音识别的便利。为了解决这个问题，微信智聆针对嵌入式语音识别进行研发。嵌入式语音识别，也称为嵌入式LVCSR（或离线LVCSR，Large Vocabulary Continuous Speech

基于TPNN的儿童英语声学模型训练

阅读更多关于基于TPNN的儿童英语声学模型训练

3 月，跳不动了？>>> 前言 TPNN作为学而思网校自主研发的深度学习平台，专门针对声学模型训练进行了架构优化，可以轻松帮助研发人员完成语音特征和解码器的无缝对接，同时在此框架下，我们也实现了主流的声学模型架构和高效的多卡训练技术，在TPNN的框架下，我们进行了大规模数据下儿童声学模型的技术研发。通过大量实验，包括模型结构，特征维度，建模单元等，结合n-gram语言模型，融入了上万小时的儿童英语数据，最终实现了最适合中国儿童的英语识别的声学模型架构，我们的儿童声学模型可以达到92%以上的识别精度，拥有领先业界的性能。同时考虑到业务的需要，我们也实现了儿童声学模型的离线识别方案，利用8bit量化，neon优化，混合精度运算等技术，我们可以在损失少量的性能的情况下，在移动端达到接近服务器的计算速度。本文将从TPNN的“多卡训练技术” “声学模型训练” “移动端的模型优化”这几个方面为大家介绍学而思网校的儿童声学模型训练技术。一、TPNN的多卡加速技术基于深度学习的的声学模型在语音识别领域取得了巨大的成功，但这些模型的训练都必须建立在海量的数据训练上，面对海量的训练数据，模型的训练时间大大增加，识别会严重拖慢研究和开发进度。因此高效的多卡训练方案对于一个深度学习框架是一个非常重要的环节。 TPNN拥有在NVidia的NCCL通信框架基础上，利用BMUF技术

Using ARM NEON intrinsics to add alpha and permute

阅读更多关于 Using ARM NEON intrinsics to add alpha and permute

问题 I'm developing an iOS app that needs to convert images from RGB -> BGRA fairly quickly. I would like to use NEON intrinsics if possible. Is there a faster way than simply assigning the components? void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix) { numPix /= 8; //process 8 pixels at a time uint8x8_t alpha = vdup_n_u8 (0xff); for (int i=0; i<numPix; i++) { uint8x8x3_t rgb = vld3_u8 (src); uint8x8x4_t bgra; bgra.val[0] = rgb.val[2]; //these lines are slow bgra.val[1]

RGBA to ABGR: Inline arm neon asm for iOS/Xcode

阅读更多关于 RGBA to ABGR: Inline arm neon asm for iOS/Xcode

问题 This code(very similar code, haven't tried exactly this code) compiles using Android NDK, but not with Xcode/armv7+arm64/iOS Errors in comments: uint32_t *src; uint32_t *dst; #ifdef __ARM_NEON __asm__ volatile( "vld1.32 {d0, d1}, [%[src]] \n" // error: Vector register expected "vrev32.8 q0, q0 \n" // error: Unrecognized instruction mnemonic "vst1.32 {d0, d1}, [%[dst]] \n" // error: Vector register expected : : [src]"r"(src), [dst]"r"(dst) : "d0", "d1" ); #endif What's wrong with this code?

How to OR all lane of a NEON vector

阅读更多关于 How to OR all lane of a NEON vector

问题 I want to use NEON intrinsics to optimize the following code. uint32x4_t c1; // 4 elements, each element is 0 or 1 uint32x4_t c2; // 4 elements, each element is 0 or 1 uint8_t pack = 0; // unsigned char, for result /* some code /* // need optimizing pack |= (vgetq_lane_u32(c1, 0); pack |= (vgetq_lane_u32(c1, 1) << 1; pack |= (vgetq_lane_u32(c1, 2) << 2; pack |= (vgetq_lane_u32(c1, 3) << 3; pack |= (vgetq_lane_u32(c2, 0) << 4; pack |= (vgetq_lane_u32(c2, 1) << 5; pack |= (vgetq_lane_u32(c2, 2)

Is there a way to detect VFP/NEON/Thumb/… on iOS at runtime?

阅读更多关于 Is there a way to detect VFP/NEON/Thumb/… on iOS at runtime?

问题 So it's fairly easy to figure out what kind of CPU an iOS device runs by querying sysctlbyname("hw.cpusubtype", ...) , but there seems to be no obvious way to figure out what features the CPU actually has (think VFP, NEON, Thumb, ...). Can someone think of a way to do this? Basically, what I need is something similar to getauxval(AT_HWCAP) on Linux/Android, which returns a bit mask of features supported by the CPU. A few things to note: The information must be retrieved at runtime from the OS

ARM and NEON can work in parallel?

阅读更多关于 ARM and NEON can work in parallel?

问题 This is with reference to question: Checksum code implementation for Neon in Intrinsics Opening the sub-questions listed in the link as separate individual questions. As multi questions aren't to be asked as a part of single thread. Anyway coming to the question: Can ARM and NEON (speaking in terms of arm cortex-a8 architecture) actually work in parallel? How can I achieve this? Could someone point to me or share some sample implementations(pseudo-code/algorithms/code, not the theoretical

warning: format '%ld' expects argument of type 'long int', but argument has type '__builtin_neon_di'

阅读更多关于 warning: format '%ld' expects argument of type 'long int', but argument has type '__builtin_neon_di'

问题 Wrt my this question,I am not able to cross check the output . I am getting some wrong print statement after execution .Can someone tell me whether printf() statements are wrong or logic that I am doing is wrong . CODE: int64_t arr[2] = {227802,9896688}; int64x2_t check64_2 = vld1q_s64(arr); for(int i = 0;i < 2; i++){ printf("check64_2[%d]: %ld\n",i,check64_2[i]); } int64_t way1 = check64_2[0] + check64_2[1]; int64x1_t way2 = vset_lane_s64(vgetq_lane_s64(check64_2, 0) + vgetq_lane_s64(check64

订阅 neon