avx2

Fast transpose byte matrix [][]byte in Golang assembly

六眼飞鱼酱① 提交于 2020-08-10 13:08:58
问题 Matrix transpose in pure golang is slow, and using package gonum needs structure transformation which costs extra time. So a assembly version may be a better solution. Sizes of the matrix vary ( [][]byte ) or can be fixed ( [64][512]byte ), and the element type may be int32 or int64 for general scenarios. Below is a golang version: m := 64 n := 512 // orignial matrix M := make([][]byte, m) for i := 0; i < m; i++ { M[i] = make([]byte, n) } func transpose(M [][]byte) [][]byte { m := len(M) n :=

Do all CPUs which support AVX2 also support SSE4.2 and AVX?

自作多情 提交于 2020-07-29 12:06:11
问题 I am planning to implement runtime detection of SIMD extensions. Is it such that if I find out that the processor has AVX2 support, it is also guaranteed to have SSE4.2 and AVX support? 回答1: Support for a more-recent Intel SIMD ISA extension implies support for previous SIMD ones. AVX2 definitely implies AVX1. I think AVX1 implies all of SSE/SSE2/SSE3/SSSE3/SSE4.1/SSE4.2 feature bits must also be set in CPUID. If not formally guaranteed, many things make this assumption and a CPU that

AVX2: Computing dot product of 512 float arrays

☆樱花仙子☆ 提交于 2020-07-14 17:43:31
问题 I will preface this by saying that I am a complete beginner at SIMD intrinsics. Essentially, I have a CPU which supports the AVX2 instrinsic ( Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz ). I would like to know the fastest way to compute the dot product of two std::vector<float> of size 512 . I have done some digging online and found this and this, and this stack overflow question suggests using the following function __m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask); , However, these

AVX2: Computing dot product of 512 float arrays

佐手、 提交于 2020-07-14 17:42:48
问题 I will preface this by saying that I am a complete beginner at SIMD intrinsics. Essentially, I have a CPU which supports the AVX2 instrinsic ( Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz ). I would like to know the fastest way to compute the dot product of two std::vector<float> of size 512 . I have done some digging online and found this and this, and this stack overflow question suggests using the following function __m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask); , However, these

Left-shift (of float32 array) with AVX2 and filling up with a zero

我是研究僧i 提交于 2020-06-28 03:59:52
问题 I have been using the following "trick" in C code with SSE2 for single precision floats for a while now: static inline __m128 SSEI_m128shift(__m128 data) { return (__m128)_mm_srli_si128(_mm_castps_si128(data), 4); } For data like [1.0, 2.0, 3.0, 4.0] , it results in [2.0, 3.0, 4.0, 0.0] , i.e. it does a left shift by one position and fills the data structure with a zero. If I remember correctly, the above inline function compiles down to a single instruction (with gcc at least). I am somehow