avx

Finding lists of prime numbers with SIMD - SSE/AVX

余生长醉 提交于 2020-01-21 05:26:05
问题 I'm curious if anyone has advice on how to use SIMD to find lists of prime numbers. Particularly I'm interested how to do this with SSE/AVX. The two algorithms I have been looking at are trial division and the Sieve of Eratosthenes. I have managed to find a way to use SSE with trial division. I found a faster way to to division which works well for a vector/scalar "Division by Invariant Integers Using Multiplication"http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556 Each time I

Vectorizing merge/union of two sorted arrays

半城伤御伤魂 提交于 2020-01-15 08:43:45
问题 I have recently started looking into opportunities to speed up my code by using vector instructions. My code heavily relies on operations with sets - for simplicity let us assume that these are represented as sorted arrays of 16bit unsigned integers. The operations I need to perform are: Intersection (i.e., each element contained in both sets is to be present in the output set) Union (i.e., each element that is contained in at least one of the sets is to be present in the output set exactly

Select unique/deduplication in SSE/AVX

我与影子孤独终老i 提交于 2020-01-13 08:27:11
问题 Problem Are there any computationally feasible approaches to intra-register deduplication of a set of integers using x86 SIMD instructions? Example We have a 4-tuple register R1 = {3, 9, 2, 9}, and wish to obtain register R2 = {3, 9, 2, NULL}. Restrictions Stablility . Preservation of the input order is of no significance. Output . However, any removed values/NULLs must be at the beginning and/or end of the register: {null, 1, 2, 3} - OK {1, 2, null, null} - OK {null, 2, null, null} - OK

Comparison with NaN using AVX

北城以北 提交于 2020-01-13 01:38:47
问题 I am trying to create a fast decoder for BPSK using the AVX intrinsics of Intel. I have a set of complex numbers that are represented as interleaved floats, but due to the BPSK modulation only the real part (or the even indexed floats) are needed. Every float x is mapped to 0 , when x < 0 and to 1 if x >= 0 . This is accomplished using the following routine: static inline void normalize_bpsk_constellation_points(int32_t *out, const complex_t *in, size_t num) { static const __m256 _min_mask =

AVX: data alignment: store crash, storeu, load, loadu doesn't

孤街醉人 提交于 2020-01-11 09:19:07
问题 I am modifying RNNLM a neural net to study language model. However given the size of my corpus it's running real slow. I tried to optimize the matrix*vector routine (which is the one accountable for 63% of total time for small data set (I would expect it to be worse on larger sets)). Right now I am stuck with intrinsics. for (b=0; b<(to-from)/8; b++) { val = _mm256_setzero_ps(); for (a=from2; a<to2; a++) { t1 = _mm256_set1_ps (srcvec.ac[a]); t2 = _mm256_load_ps(&(srcmatrix[a+(b*8+from+0)

Find the first instance of a character using simd

£可爱£侵袭症+ 提交于 2020-01-10 02:59:06
问题 I am trying to find the first instance of a character, in this case '"' using simd (AVX2 or earlier). I'd like to use _mm256_cmpeq_epi8, but then I need a quick way of finding if any of the resulting bytes in the __m256i have been set to 0xFF. The plan was then to use _mm256_movemask_epi8 to convert the result from bytes to bits, and the to use ffs to get a matching index. Is it better to move out a portion at a time using _mm_movemask_epi8? Any other suggestions? 回答1: You have the right idea

FLT_EPSILON for a nth root finder with SSE/AVX

亡梦爱人 提交于 2020-01-07 06:50:26
问题 I'm trying to convert a function that finds the nth root in C for a double value from the following link http://rosettacode.org/wiki/Nth_root#C to find the nth root for 8 floats at once using AVX. Part of that code uses DBL_EPSILON * 10. However, when I convert this to use float with AVX I have to use FLT_EPSILON*1000 or the code hangs and does not converge. When I print out FLT_EPSILON I see it is order 1E-7. But this link, http://www.cplusplus.com/reference/cfloat/ , says it should be 1E-5.

Can I compile OpenCL code into ordinary, OpenCL-free binaries?

痴心易碎 提交于 2020-01-05 15:19:14
问题 I am evaluating OpenCL for my purposes. It occurred to me that you can't assume it working out-of-the-box on either Windows or Mac because: Windows needs an OpenCL driver (which, of course, can be installed) MacOS supports OpenCL only on MacOS >= 10.6 So I'd have to code FPU/SSE/AVX code and OpenCL separately to produce two binaries: one without and one with OpenCL support. It would be much better, if I could compile OpenCL at compiletime into SSE/AVX and then ship a binary without OpenCL in

Can I compile OpenCL code into ordinary, OpenCL-free binaries?

╄→尐↘猪︶ㄣ 提交于 2020-01-05 15:18:46
问题 I am evaluating OpenCL for my purposes. It occurred to me that you can't assume it working out-of-the-box on either Windows or Mac because: Windows needs an OpenCL driver (which, of course, can be installed) MacOS supports OpenCL only on MacOS >= 10.6 So I'd have to code FPU/SSE/AVX code and OpenCL separately to produce two binaries: one without and one with OpenCL support. It would be much better, if I could compile OpenCL at compiletime into SSE/AVX and then ship a binary without OpenCL in

Rotating (by 90°) a bit matrix (up to 8x8 bits) within a 64-bit integer

帅比萌擦擦* 提交于 2020-01-04 15:15:10
问题 I have a bit matrix (of size 6x6, or 7x7, or 8x8) stored within one single 64-bit integer. I am looking for c++ code that rotates these matrices by 90, 180, 270 degrees, as well as c++ code for shifting (horizontally and vertically) and mirroring these matrices. The output must be again a 64-bit integer. Using some of the advanced CPU instruction sets would probably be okay, as well as using hash tables or similar techniques - speed is of highest importance, and RAM is available. I will run