avx

Is it okay to mix legacy SSE encoded instructions and VEX encoded ones in the same code path?

北城以北 提交于 2019-12-05 02:31:10
Along with the introduction of AVX, Intel introduced the VEX encoding scheme into the Intel 64 and IA-32 architecture. This encoding scheme is used mostly with AVX instructions. I was wondering if it's okay to intermix VEX-encoded instructions and the now called "legacy SSE" instructions. The main reason for me asking this question is code size. Consider these two instructions : shufps xmm0, xmm0, 0 vshufps xmm0, xmm0, xmm0, 0 I commonly use the first one to "broadcast" a scalar value to all the places in an XMM register. Now, the instruction set says that the only difference between these two

Fastest way to expand bits in a field to all (overlapping + adjacent) set bits in a mask?

前提是你 提交于 2019-12-05 00:27:59
Say I have 2 binary inputs named IN and MASK. Actual field size could be 32 to 256 bits depending on what instruction set is used to accomplish the task. Both inputs change every call. Inputs: IN = ...1100010010010100... MASK = ...0001111010111011... Output: OUT = ...0001111010111000... edit: another example result from some comment discussion IN = ...11111110011010110... MASK = ...01011011001111110... Output: OUT = ...01011011001111110... I want to get the contiguous adjacent 1 bits of MASK that a 1 bit of IN is within. (Is there a general term for this kind of operation? Maybe I'm not

Shift elements to the left of a SIMD register based on boolean mask

蹲街弑〆低调 提交于 2019-12-04 20:57:40
This question is related to this: Optimal uint8_t bitmap into a 8 x 32bit SIMD "bool" vector I would like to create an optimal function with this signature: __m256i PackLeft(__m256i inputVector, __m256i boolVector); The desired behaviour is that on an input of 64bit int like this: inputVector = {42, 17, 13, 3} boolVector = {true, false, true, false} It masks all values that have false in the boolVector and then repacks the values that remain to the left. On the output above, the return value should be: {42, 13, X, X} ... Where X is "I don't care". An obvious way to do this is the use _mm

SSE - AVX conversion from double to char

别说谁变了你拦得住时间么 提交于 2019-12-04 18:34:19
I want to convert a vector of double precision values to char. I have to make two distinct approaches, one for SSE2 and the other for AVX2. I started with AVX2. __m128i sub_proc(__m256d& in) { __m256d _zero_pd = _mm256_setzero_pd(); __m256d ih_pd = _mm256_unpackhi_pd(in,_zero_pd); __m256d il_pd = _mm256_unpacklo_pd(in,_zero_pd); __m128i ih_si = _mm256_cvtpd_epi32(ih_pd); __m128i il_si = _mm256_cvtpd_epi32(il_pd); ih_si = _mm_shuffle_epi32(ih_si,_MM_SHUFFLE(3,1,2,0)); il_si = _mm_shuffle_epi32(il_si,_MM_SHUFFLE(3,1,2,0)); ih_si = _mm_packs_epi32(_mm_unpacklo_epi32(il_si,ih_si),_mm_unpackhi

MSVC /arch:[instruction set] - SSE3, AVX, AVX2

橙三吉。 提交于 2019-12-04 18:22:06
Here is an example of a class which shows supported instruction sets. https://msdn.microsoft.com/en-us/library/hskdteyh.aspx I want to write three different implementations of a single function, each of them using different instruction set. But due to flag /ARCH:AVX2, for example, this app won't ever run anywhere but on 4th+ generation of Intel processors, so the whole point of checking is pointless. So, question is: what exactly this flag does? Enables support or enables compiler optimizations using provided instruction sets ? In other words, can I completely remove this flag and keep using

Aligned and unaligned memory access with AVX/AVX2 intrinsics

拥有回忆 提交于 2019-12-04 17:35:04
问题 According to Intel's Software Developer Manual (sec. 14.9), AVX relaxed the alignment requirements of memory accesses. If data is loaded directly in a processing instruction, e.g. vaddps ymm0,ymm0,YMMWORD PTR [rax] the load address doesn't have to be aligned. However, if a dedicated aligned load instruction is used, such as vmovaps ymm0,YMMWORD PTR [rax] the load address has to be aligned (to multiples of 32), otherwise an exception is raised. What confuses me is the automatic code generation

Using __m256d registers

青春壹個敷衍的年華 提交于 2019-12-04 12:17:45
问题 How do you use __m256d ? Say I want to use the Intel AVX instruction _mm256_add_pd on a simple Vector3 class with 3-64 bit double precision components ( x , y , and z ). What is the correct way to use this? Since x , y and z are members of the Vector3 class, _can I declare them in union with an __m256d variable? union Vector3 { struct { double x,y,z ; } ; __m256d _register ; // the Intel register? } ; Then can I go: Vector3 add( const Vector3& o ) { Vector3 result; result._register = _mm256

are static / static local SSE / AVX variables blocking a xmm / ymm register?

帅比萌擦擦* 提交于 2019-12-04 12:01:23
问题 When using SSE intrinsics, often zero vectors are required. One way to avoid creating a zero variable inside a function whenever the function is called (each time effectively calling some xor vector instruction) would be to use a static local variable, as in static inline __m128i negate(__m128i a) { static __m128i zero = __mm_setzero_si128(); return _mm_sub_epi16(zero, a); } It seems the variable is only initialized when the function is called for the first time. (I checked this by calling a

AVX convert 64 bit integer to 64 bit float

女生的网名这么多〃 提交于 2019-12-04 10:59:00
I would like to convert 4 packed 64 bit integers to 4 packed 64 bit floats using AVX. I've tried something like: int_64t *ls = (int64_t *) _mm_malloc(256, 32); ls[0] = a; //... ls[3] = d; __mm256i packed = _mm256_load_si256((__m256i const *)ls); Which will display in the debugger: (gdb) print packed $4 = {1234, 5678, 9012, 3456} Okay so far, but the only cast/conversion operation that I can find is _mm256i_castsi256_pd, which doesn't get me what I want: __m256d pd = _mm256_castsi256_pd(packed); (gdb) print pd $5 = {6.0967700696809824e-321, 2.8053047370865979e-320, 4.4525196003213139e-320, 1

How to speed up calculation of integral image?

和自甴很熟 提交于 2019-12-04 09:59:38
I often need to calculate integral image. This is simple algorithm: uint32_t void integral_sum(const uint8_t * src, size_t src_stride, size_t width, size_t height, uint32_t * sum, size_t sum_stride) { memset(sum, 0, (width + 1) * sizeof(uint32_t)); sum += sum_stride + 1; for (size_t row = 0; row < height; row++) { uint32_t row_sum = 0; sum[-1] = 0; for (size_t col = 0; col < width; col++) { row_sum += src[col]; sum[col] = row_sum + sum[col - sum_stride]; } src += src_stride; sum += sum_stride; } } And I have a question. Can I speed up this algorithm (for example, with using of SSE or AVX)?