avx

AVX inside a VirtualBox VM?

独自空忆成欢 提交于 2019-11-29 14:39:31
I install the latest Ubuntu 14.04 amd64(gcc 4.8.2) in virtualbox, run cat /proc/cpuinfo, get result: The processor CORE i52520M does support AVX instructions. I used Ubuntu 12.04 amd64(gcc 4.6), and it supports AVX via /proc/cpuinfo. How can I use the AVX in my software in virtualbox? VirtualBox 5.0 beta 3 now supports AVX and AVX2 (which I can confirm from testing). 来源: https://stackoverflow.com/questions/24543874/avx-inside-a-virtualbox-vm

Shuffling by mask with Intel AVX

时光总嘲笑我的痴心妄想 提交于 2019-11-29 14:32:20
问题 I'm new to AVX programming. I have a register which needs to be shuffled. I want to shuffle several bytes from a 256-bit register, R1, to an empty register R2. I want to define a mask which tells the shuffle operation which byte from the old register(R1) should be copied at which place in the new register. The mask should look like this(Src:Byte Pos in R1, Target:Byte Pos in R2): {(0,0),(1,1),(1,4),(2,5),...} This means several bytes are copied twice. I'm not 100% sure which function I should

Intel c++ compiler, ICC, seems to ingnore SSE/AVX seetings

我只是一个虾纸丫 提交于 2019-11-29 11:25:31
I have recently downloaded and installed the Intel C++ compiler, Composer XE 2013, for Linux which is free to use for non-commercial development. http://software.intel.com/en-us/non-commercial-software-development I'm running on a ivy bridge system (which has AVX). I have two versions of a function which do the same thing. One does not use SSE/AVX. The other version uses AVX. In GCC the AVX code is about four times faster than the scalar code. However, with the Intel C++ compiler the performance is much worse. With GCC I compile like this gcc m6.cpp -o m6_gcc -O3 -mavx -fopenmp -Wall -pedantic

Shifting 4 integers right by different values SIMD

生来就可爱ヽ(ⅴ<●) 提交于 2019-11-29 11:02:48
SSE does not provide a way of shifting packed integers by a variable amount (I can use any instructions AVX and older). You can only do uniform shifts. The result I'm trying to achieve for each integer in the vector is this. i[0] = i[0] & 0b111111; i[1] = (i[1]>>6) & 0b111111; i[2] = (i[2]>>12) & 0b111111; i[3] = (i[3]>>18) & 0b111111; Essentially trying to isolate a different group of 6 bits in each integer. So what is the optimal solution? Things I thought about: You can simulate a variable right shift, with a variable left shift and a uniform right shift. I thought about multiplying the

Is _mm_broadcast_ss faster than _mm_set1_ps?

♀尐吖头ヾ 提交于 2019-11-29 10:40:28
Is this code float a = ...; __m256 b = _mm_broadcast_ss(&a) always faster than this code float a = ...; _mm_set1_ps(a) ? What if a defined as static const float a = ... rather than float a = ... ? mm_broadcast_ss is likely to be faster than mm_set1_ps. The former translates into a single instruction (VBROADCASTSS), while the latter is emulated using multiple instructions (probably a MOVSS followed by a shuffle). However, mm_broadcast_ss requires the AVX instruction set, while only SSE is required for mm_set1_ps. _mm_broadcast_ss has weaknesses imposed by the architecture which are largely

Wrapper for `__m256` Producing Segmentation Fault with Constructor - Windows 64 + MinGW + AVX Issues

走远了吗. 提交于 2019-11-29 10:37:11
I have a union that looks like this union bareVec8f { __m256 m256; //avx 8x float vector float floats[8]; int ints[8]; inline bareVec8f(){ } inline bareVec8f(__m256 vec){ this->m256 = vec; } inline bareVec8f &operator=(__m256 m256) { this->m256 = m256; return *this; } inline operator __m256 &() { return m256; } } the __m256 needs to be aligned on 32 byte boundary to be used with SSE functions, and should be automatically, even within the union. And when I do this bareVec8f test = _mm256_set1_ps(1.0f); I get a segmentation fault. This code should work because of the constructor I made. However,

Using ymm registers as a “memory-like” storage location

China☆狼群 提交于 2019-11-29 10:22:57
Consider the following loop in x86: ; on entry, rdi has the number of iterations .top: ; some magic happens here to calculate a result in rax mov [array + rdi * 8], rax ; store result in output array dec rdi jnz .top It's straightforward: something calculates a result in rax (not shown) and then we store the result into an array, in reverse order as we index with rdi . I would like to transform the above loop not make any writes to memory (we can assume the non-shown calculation doesn't write to memory). As long as the loop count in rdi is limited, I could use the ample space (512 bytes)

Why do SSE instructions preserve the upper 128-bit of the YMM registers?

馋奶兔 提交于 2019-11-29 09:25:36
It seems to be a recurring problem that many Intel processors (up until Skylake, unless I'm wrong) exhibit poor performance when mixing AVX-256 instructions with SSE instructions. According to Intel's documentation , this is caused by SSE instructions being defined to preserve the upper 128 bits of the YMM registers, so in order to be able to save power by not using the upper 128 bits of the AVX datapaths, the CPU stores those bits away when executing SSE code and reloads them when entering AVX code, the stores and loads being expensive. However, I can find no obvious reason or explanation why

How to find the horizontal maximum in a 256-bit AVX vector

只愿长相守 提交于 2019-11-29 09:19:26
I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My attempts all ended up using a lot of shuffling of the vector elements, making the code not very elegant nor efficient. Also, I found it impossible to stay only in the AVX domain. At some point I had to use SSE 128-bit instructions to extract the final 64-bit value. However, I would like to be proved wrong on this last statement. So the ideal solution will: 1) only use only AVX instructions. 2) minimize

Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

雨燕双飞 提交于 2019-11-29 08:14:00
In a piece of C++ code that does something similar to (but not exactly) matrix multiplication, I load 4 contiguous doubles into 4 YMM registers like this: # a is a 64-byte aligned array of double __m256d b0 = _mm256_broadcast_sd(&b[4*k+0]); __m256d b1 = _mm256_broadcast_sd(&b[4*k+1]); __m256d b2 = _mm256_broadcast_sd(&b[4*k+2]); __m256d b3 = _mm256_broadcast_sd(&b[4*k+3]); I compiled the code with gcc-4.8.2 on a Sandy Bridge machine. Hardware event counters (Intel PMU) suggests that the CPU actually issue 4 separate load from the L1 cache. Although at this point I'm not limited by L1 latency