x86-64

Storing individual doubles from a packed double vector using Intel AVX

空扰寡人 提交于 2019-12-07 05:59:28
问题 I'm writing code using the C intrinsics for Intel's AVX instructions. If I have a packed double vector (a __m256d ), what would be the most efficient way (i.e. the least number of operations) to store each of them to a different place in memory (i.e. I need to fan them out to different locations such that they are no longer packed)? Pseudocode: __m256d *src; double *dst; int dst_dist; dst[0] = src[0]; dst[dst_dist] = src[1]; dst[2 * dst_dist] = src[2]; dst[3 * dst_dist] = src[3]; Using SSE, I

Are word-aligned loads faster than unaligned loads on x64 processors?

允我心安 提交于 2019-12-07 04:11:44
问题 Are loads of variables that are aligned on word boundaries faster than unaligned load operations on x86/64 (Intel/AMD 64 bit) processors? A colleague of mine argues that unaligned loads are slow and should be avoided. He cites the padding of items to word boundaries in structs as a proof that unaligned loads are slow. Example: struct A { char a; uint64_t b; }; The struct A as usually a size of 16 bytes. On the other hand, the documentation of the Snappy compressor states that Snappy assumes

GCC code that seems to break inline assembly rules but an expert believes otherwise

孤者浪人 提交于 2019-12-07 04:03:55
问题 I was engaged with an expert who allegedly has vastly superior coding skills than myself who understands inline assembly far better than I ever could. One of the claims is that as long as an operand appears as an input constraint, you don't need to list it as a clobber or specify that the register has been potentially modified by the inline assembly. The conversation came about when someone else was trying to get assistance on a memset implementation that was effectively coded this way: void

Efficient (on Ryzen) way to extract the odd elements of a __m256 into a __m128?

£可爱£侵袭症+ 提交于 2019-12-07 03:33:53
问题 Is there an intrinsic or another efficient way for repacking high/low 32-bit components of 64-bit components of AVX register into an SSE register? A solution using AVX2 is ok. So far I'm using the following code, but profiler says it's slow on Ryzen 1800X : // Global constant const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1); // ... // function code __m256i x = /* computed here */; const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(x),

gcc argument register spilling on x86-64

こ雲淡風輕ζ 提交于 2019-12-07 03:22:36
问题 I'm doing some experimenting with x86-64 assembly. Having compiled this dummy function: long myfunc(long a, long b, long c, long d, long e, long f, long g, long h) { long xx = a * b * c * d * e * f * g * h; long yy = a + b + c + d + e + f + g + h; long zz = utilfunc(xx, yy, xx % yy); return zz + 20; } With gcc -O0 -g I was surprised to find the following in the beginning of the function's assembly: 0000000000400520 <myfunc>: 400520: 55 push rbp 400521: 48 89 e5 mov rbp,rsp 400524: 48 83 ec 50

xorl %eax, %eax in x86_64 assembly code produced by gcc

Deadly 提交于 2019-12-07 02:19:45
问题 I'm a total noob at assembly, just poking around a bit to see what's going on. Anyway, I wrote a very simple function: void multA(double *x,long size) { long i; for(i=0; i<size; ++i){ x[i] = 2.4*x[i]; } } I compiled it with: gcc -S -m64 -O2 fun.c And I get this: .file "fun.c" .text .p2align 4,,15 .globl multA .type multA, @function multA: .LFB34: .cfi_startproc testq %rsi, %rsi jle .L1 movsd .LC0(%rip), %xmm1 xorl %eax, %eax .p2align 4,,10 .p2align 3 .L3: movsd (%rdi,%rax,8), %xmm0 mulsd

On a 64 bit machine, can I safely operate on individual bytes of a 64 bit quadword in parallel?

喜欢而已 提交于 2019-12-06 20:01:35
问题 Background I am doing parallel operations on rows and columns in images. My images are 8 bit or 16 bit pixels and I'm on a 64 bit machine. When I do operations on columns in parallel, two adjacent columns may share the same 32 bit int or 64 bit long . Basically, I want to know whether I can safely operate on individual bytes of the same quadword in parallel. Minimal Test I wrote a minimal test function that I have not been able to make fail. For each byte in a 64 bit long , I concurrently

I don't understand why compiler is giving me error with this code

天大地大妈咪最大 提交于 2019-12-06 19:53:39
问题 I have the following C code, which looks very correct to me. However, the clang compiler (infact gcc or any other C compiler too) thinks otherwise. typedef struct { struct timeval td_start; struct timeval td_end; } Timer; void startTimer( struct Timer* ptimer ) { gettimeofday( &(ptimer->td_start), NULL ); } void stopTimer( struct Timer* ptimer ) { gettimeofday( &(ptimer->td_end), NULL ); } The compiler gives the following waring & error messages. Any idea what is wrong here? ./timing.h:14:25:

Can branch prediction cause illegal instruction?

寵の児 提交于 2019-12-06 19:22:45
问题 In the following pseudo-code: if (rdtscp supported by hardware) { Invoke "rdtscp" instruction } else { Invoke "rdtsc" instruction } Let's say the CPU does not support the rdtscp instruction and so we fallback to the else statement. If CPU mispredicts the branch, is it possible for the instruction pipeline to try to execute rdtscp and throw an Illgal Instruction error? 回答1: It is explicitly documented for the #UD trap (Invalid Opcode Execution) in the Intel Processor Manuals, Volume 3A,

VirtualBox - Kernel requires an x86-64 cpu but only detected an i686 cpu

烈酒焚心 提交于 2019-12-06 17:41:45
问题 Intel i5-2410M CPU running at 2.30 GHz running a Windows 7 64-bit operating system. I have VirtualBox 4.13 installed. I am trying to run ubuntu-14.04-desktop-amd64.iso but I get an error this kernel requires an x86-64 cpu but only detected an i686 cpu I even enabled the Intel Virtualization in the BIOS settings and then tried to use the image again but I still get the same error. Is there any other reason why I can't use the image? 回答1: My best guess is that you somehow configured the VM for