avx

How to check if a CPU supports the SSE3 instruction set?

夙愿已清 提交于 2019-12-17 02:26:46
问题 Is the following code valid to check if a CPU supports the SSE3 instruction set? Using the IsProcessorFeaturePresent() function apparently does not work on Windows XP (see http://msdn.microsoft.com/en-us/library/ms724482(v=vs.85).aspx). bool CheckSSE3() { int CPUInfo[4] = {-1}; //-- Get number of valid info ids __cpuid(CPUInfo, 0); int nIds = CPUInfo[0]; //-- Get info for id "1" if (nIds >= 1) { __cpuid(CPUInfo, 1); bool bSSE3NewInstructions = (CPUInfo[2] & 0x1) || false; return

Why is the AVX-256 VMOVAPS Instruction only copying four single precision floats instead of 8?

这一生的挚爱 提交于 2019-12-13 19:07:45
问题 I am trying to familiarize myself with the 256-bit AVX instructions available on some of the newer Intel processors. I have already verified that my i7-4720HQ supports 256-bit AVX instructions. The problem I am having is that the VMOVAPS instruction, which should copy 8 single precision floating point values, is only copying 4. dot PROC VMOVAPS YMM1, ymmword ptr [RCX] VDPPS YMM2, YMM1, ymmword ptr [RDX], 255 VMOVAPS ymmword ptr [RCX], YMM2 MOVSS XMM0, DWORD PTR [RCX] RET dot ENDP In case you

C style cast versus intrinsic cast

北城以北 提交于 2019-12-13 14:19:39
问题 Let's assume i have defined __m256d x and that I want to extract the lower 128-bits. I would do: __m128d xlow = _mm256_castpd256_pd128(x); However, I recently saw someone do: __m128d xlow = (__m128d) x Is there a prefered method to use for the cast? Why use the first method? 来源: https://stackoverflow.com/questions/20401413/c-style-cast-versus-intrinsic-cast

MSVC /arch:[instruction set] - SSE3, AVX, AVX2

随声附和 提交于 2019-12-13 12:19:28
问题 Here is an example of a class which shows supported instruction sets. https://msdn.microsoft.com/en-us/library/hskdteyh.aspx I want to write three different implementations of a single function, each of them using different instruction set. But due to flag /ARCH:AVX2, for example, this app won't ever run anywhere but on 4th+ generation of Intel processors, so the whole point of checking is pointless. So, question is: what exactly this flag does? Enables support or enables compiler

How to create a 8 bit mask from lsb of __m64 value?

最后都变了- 提交于 2019-12-13 08:54:24
问题 I have a use case, where I have array of bits each bit is represented as 8 bit integer for example uint8_t data[] = {0,1,0,1,0,1,0,1}; I want to create a single integer by extracting only lsb of each value. I know that using int _mm_movemask_pi8 (__m64 a) function I can create a mask but this intrinsic only takes a msb of a byte not lsb. Is there a similar intrinsic or efficient method to extract lsb to create single 8 bit integer? 回答1: There is no direct way to do it, but obviously you can

Performing AVX integer operation

拥有回忆 提交于 2019-12-13 06:04:39
问题 I'm trying to optimize some integer (_int64) operations using AVX. However, I can't even simple add operation. It keeps telling me illegal instruction. Pls can I be corrected on what i'm doing wrong? Thanks for (int i = 0; i < 1; i+=4) { __m256i rA, rB, rC; __m256i *iu, *ju, *ku; iu = (__m256i *)(MatrixAiB1 + i); ju = (__m256i *)(MatrixAjB1+ i); ku = (__m256i *) (store+ i); rA=_mm256_load_si256(iu); rB=_mm256_load_si256(ju); rC=_mm256_add_epi16(rA,rB); _mm256_store_si256(ku,rC); } 回答1: You

TensorFlow error using AVX instructions on Linux while working on Windows on the same machine

无人久伴 提交于 2019-12-13 03:55:52
问题 I'm using a Dual-Boot machine with Windows and Ubuntu and try to run a code which works well while windows is used but errors when Ubuntu is used. The error says: F tensorflow/core/platform/cpu_feature_guard.cc:37] The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine. While running the same code using Windows it gives a similar warning but still runs. Couldn't find any solution regarding to the problem on the net. System specifications:

Extracting ints and shorts from a struct using AVX?

╄→гoц情女王★ 提交于 2019-12-12 14:22:45
问题 I have a struct which contains a union between various data members and an AVX type to load all the bytes in one load. My code looks like: #include <immintrin.h> union S{ struct{ int32_t a; int32_t b; int16_t c; int16_t d; }; __m128i x; } I'd like to use the AVX register to load the data all together and then separately extract the four members in to int32_t and int16_t local variables. How would I go about doing this? I am unsure how I can separate the data members from each other when

Improving a recursive hadamard transformation

梦想与她 提交于 2019-12-12 14:13:32
问题 I have the following code to calculate a Hadamard transform. Right now, the hadamard function is the bottleneck of my program. Do you see any potential to speed it up? Maybe using AVX2 instructions? Typical input sizes are around 512 or 1024. Best, Tom #include <stdio.h> void hadamard(double *p, size_t len) { double tmp = 0.0; if(len == 2) { tmp = p[0]; p[0] = tmp + p[1]; p[1] = tmp - p[1]; } else { hadamard(p, len/2); hadamard(p+len/2, len/2); for(int i = 0; i < len/2; i++) { tmp = p[i]; p[i

Why _mm256_load_pd compiled to MOVUPD instead of MOVAPD?

断了今生、忘了曾经 提交于 2019-12-12 11:19:30
问题 Why the following code results unaligned AVX instructions ( MOVUPD instead of MOVAPD)? I compiled this on Visual Studio 2015. How can I tell the compiler that my data is indeed aligned? const size_t ALIGN_SIZE = 64; const size_t ARRAY_SIZE = 1024; double __declspec(align(ALIGN_SIZE)) a[ARRAY_SIZE]; double __declspec(align(ALIGN_SIZE)) b[ARRAY_SIZE]; //Calculate the dotproduct __m256d ymm0 = _mm256_set1_pd(0.0); for (int i = 0; i < ARRAY_SIZE; i += 8) { __m256d ymm1 = _mm256_load_pd(a + i); _