simd

Comparison with NaN using AVX

北城以北 提交于 2020-01-13 01:38:47
问题 I am trying to create a fast decoder for BPSK using the AVX intrinsics of Intel. I have a set of complex numbers that are represented as interleaved floats, but due to the BPSK modulation only the real part (or the even indexed floats) are needed. Every float x is mapped to 0 , when x < 0 and to 1 if x >= 0 . This is accomplished using the following routine: static inline void normalize_bpsk_constellation_points(int32_t *out, const complex_t *in, size_t num) { static const __m256 _min_mask =

Constexpr and SSE intrinsics

半世苍凉 提交于 2020-01-12 07:20:31
问题 Most C++ compilers support SIMD(SSE/AVX) instructions with intrisics like _mm_cmpeq_epi32 My problem with this is that this function is not marked as constexpr , although "semantically" there is no reason for this function to not be constexpr since it is a pure function. Is there any way I could write my own version of (for example) _mm_cmpeq_epi32 that is constexpr ? Obviously I would like that the function at runtime uses the proper asm, I know I can reimplement any SIMD function with slow

How can I apply __attribute__(( aligned(32))) to an int *?

一曲冷凌霜 提交于 2020-01-10 03:07:08
问题 In my program I need to apply __attribute__(( aligned(32))) to an int * or float * I tried like this but I'm not sure it will work. int *rarray __attribute__(( aligned(32))); I saw this but didn't find the answer 回答1: So you want to tell the compiler that your pointers are aligned? e.g. that all callers of this function will pass pointers that are guaranteed to be aligned. Either pointers to aligned static or local storage, or pointers they got from C11 aligned_alloc or POSIX posix_memalign .

Find the first instance of a character using simd

£可爱£侵袭症+ 提交于 2020-01-10 02:59:06
问题 I am trying to find the first instance of a character, in this case '"' using simd (AVX2 or earlier). I'd like to use _mm256_cmpeq_epi8, but then I need a quick way of finding if any of the resulting bytes in the __m256i have been set to 0xFF. The plan was then to use _mm256_movemask_epi8 to convert the result from bytes to bits, and the to use ffs to get a matching index. Is it better to move out a portion at a time using _mm_movemask_epi8? Any other suggestions? 回答1: You have the right idea

How to do runtime binding based on CPU capabilities on linux

[亡魂溺海] 提交于 2020-01-09 19:16:00
问题 Is it possible to have a linux library (e.g. "libloader.so") load another library to resolve any external symbols? I've got a whole bunch of code that gets conditionally compiled for the SIMD level to be supported ( SSE2, AVX, AVX2 ). This works fine if the build platform is the same as the runtime platform. But it hinders reuse across different processor generations. One thought is to have executable which calls function link to libloader.so that does not directly implement function . Rather

How to do runtime binding based on CPU capabilities on linux

痴心易碎 提交于 2020-01-09 19:15:29
问题 Is it possible to have a linux library (e.g. "libloader.so") load another library to resolve any external symbols? I've got a whole bunch of code that gets conditionally compiled for the SIMD level to be supported ( SSE2, AVX, AVX2 ). This works fine if the build platform is the same as the runtime platform. But it hinders reuse across different processor generations. One thought is to have executable which calls function link to libloader.so that does not directly implement function . Rather

ARM and NEON can work in parallel?

谁都会走 提交于 2020-01-09 09:16:06
问题 This is with reference to question: Checksum code implementation for Neon in Intrinsics Opening the sub-questions listed in the link as separate individual questions. As multi questions aren't to be asked as a part of single thread. Anyway coming to the question: Can ARM and NEON (speaking in terms of arm cortex-a8 architecture) actually work in parallel? How can I achieve this? Could someone point to me or share some sample implementations(pseudo-code/algorithms/code, not the theoretical

multiplication using SSE (x*x*x)+(y*y*y)

廉价感情. 提交于 2020-01-06 14:18:10
问题 I'm trying to optimize this function using SIMD but I don't know where to start. long sum(int x,int y) { return x*x*x+y*y*y; } The disassembled function looks like this: 4007a0: 48 89 f2 mov %rsi,%rdx 4007a3: 48 89 f8 mov %rdi,%rax 4007a6: 48 0f af d6 imul %rsi,%rdx 4007aa: 48 0f af c7 imul %rdi,%rax 4007ae: 48 0f af d6 imul %rsi,%rdx 4007b2: 48 0f af c7 imul %rdi,%rax 4007b6: 48 8d 04 02 lea (%rdx,%rax,1),%rax 4007ba: c3 retq 4007bb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) The calling code

warning: format '%ld' expects argument of type 'long int', but argument has type '__builtin_neon_di'

家住魔仙堡 提交于 2020-01-06 13:14:16
问题 Wrt my this question,I am not able to cross check the output . I am getting some wrong print statement after execution .Can someone tell me whether printf() statements are wrong or logic that I am doing is wrong . CODE: int64_t arr[2] = {227802,9896688}; int64x2_t check64_2 = vld1q_s64(arr); for(int i = 0;i < 2; i++){ printf("check64_2[%d]: %ld\n",i,check64_2[i]); } int64_t way1 = check64_2[0] + check64_2[1]; int64x1_t way2 = vset_lane_s64(vgetq_lane_s64(check64_2, 0) + vgetq_lane_s64(check64

warning: format '%ld' expects argument of type 'long int', but argument has type '__builtin_neon_di'

丶灬走出姿态 提交于 2020-01-06 13:13:14
问题 Wrt my this question,I am not able to cross check the output . I am getting some wrong print statement after execution .Can someone tell me whether printf() statements are wrong or logic that I am doing is wrong . CODE: int64_t arr[2] = {227802,9896688}; int64x2_t check64_2 = vld1q_s64(arr); for(int i = 0;i < 2; i++){ printf("check64_2[%d]: %ld\n",i,check64_2[i]); } int64_t way1 = check64_2[0] + check64_2[1]; int64x1_t way2 = vset_lane_s64(vgetq_lane_s64(check64_2, 0) + vgetq_lane_s64(check64