x86-64 | 易学教程

How to move double in %rax into particular qword position on %ymm or %zmm? (Kaby Lake or later)

阅读更多关于 How to move double in %rax into particular qword position on %ymm or %zmm? (Kaby Lake or later)

The idea is that I'd like to collect returned values of double into a vector register for processing for machine imm width at a time without storing back into memory first. The particular processing is a vfma with other two operands that are all constexpr , so that they can simply be summoned by _mm256_setr_pd or aligned/unaligned memory load from constexpr array . Is there a way to store double in %ymm at particular position directly from value in %rax for collecting purpose? The target machine is Kaby Lake. More efficient of future vector instructions are welcome also. Inline-assembly is

Why does this MOVSS instruction use RIP-relative addressing? [duplicate]

阅读更多关于 Why does this MOVSS instruction use RIP-relative addressing? [duplicate]

This question already has an answer here: Why is the address of static variables relative to the Instruction Pointer? 1 answer I found the following assembly code in disassembler (floating point logic c++). 842: movss 0x21a(%rip),%xmm0 I understand that when process rip will allways be 842 and this 0x21a(%rip) will be const. It seems a little odd to use this register. I want to know is there any advantage of using rip relative address, instead other addressing. RIP is the instruction pointer register, which means that it contains the address of the instruction immediately following the current

Push and Pop on AMD64 [duplicate]

阅读更多关于 Push and Pop on AMD64 [duplicate]

This question already has an answer here: Does each PUSH instruction push a multiple of 8 bytes on x64? 2 answers What is the equivilent of pushl %ecx and popl %ecx on a AMD64 sytem, My results are Error: invalid instruction suffix for 'push' I have had a look and some one suggested changing ecx to rcx but that just resulted in Incorrect register '%rcx' used with 'l' suffix Thanks for your help. On AMD64, push and pop operations are implicitly 64-bits and have no 32-bit counterparts. Try: pushq %rcx popq %rcx See here for details. 来源： https://stackoverflow.com/questions/5050186/push-and-pop-on

Why is there no “sub rsp” instruction in this function prologue and why are function parameters stored at negative rbp offsets?

阅读更多关于 Why is there no “sub rsp” instruction in this function prologue and why are function parameters stored at negative rbp offsets?

That's what I understood by reading some memory segmentation documents: when a function is called, there are a few instructions (called function prologue) that save the frame pointer on the stack, copy the value of the stack pointer into the base pointer and save some memory for local variables. Here's a trivial code I am trying to debug using GDB: void test_function(int a, int b, int c, int d) { int flag; char buffer[10]; flag = 31337; buffer[0] = 'A'; } int main() { test_function(1, 2, 3, 4); } The purpose of debugging this code was to understand what happens in the stack when a function is

Compiler using local variables without adjusting RSP

阅读更多关于 Compiler using local variables without adjusting RSP

In question Compilers: Understanding assembly code generated from small programs the compiler uses two local variables without adjusting the stack pointer. Not adjusting RSP for the use of local variables seems not interrupt safe and so the compiler seems to rely on the hardware automatically switching to a system stack when interrupts occur. Otherwise, the first interrupt that came along would push the instruction pointer onto the stack and would overwrite the local variable. The code from that question is: #include <stdio.h> int main() { for(int i=0;i<10;i++){ int k=0; } } The assembly code

Why does compiler generate additional sqrts in the compiled assembly code

阅读更多关于 Why does compiler generate additional sqrts in the compiled assembly code

I'm trying to profile the time it takes to compute a sqrt using the following simple C code, where readTSC() is a function to read the CPU's cycle counter. double sum = 0.0; int i; tm = readTSC(); for ( i = 0; i < n; i++ ) sum += sqrt((double) i); tm = readTSC() - tm; printf("%lld clocks in total\n",tm); printf("%15.6e\n",sum); However, as I printed out the assembly code using gcc -S timing.c -o timing.s on an Intel machine, the result (shown below) was surprising? Why there are two sqrts in the assembly code with one using the sqrtsd instruction and the other using a function call? Is it

Why does the compiler reserve a little stack space but not the whole array size?

阅读更多关于 Why does the compiler reserve a little stack space but not the whole array size?

The following code int main() { int arr[120]; return arr[0]; } Compiles into this: sub rsp, 360 mov eax, DWORD PTR [rsp-480] add rsp, 360 ret Knowing the ints are 4 bytes and the array is size 120, the array should take 480 bytes, but only 360 bytes are subtracted from ESP... Why is this? Below the stack area used by a function, there is a 128-byte red zone that is reserved for program use. Since main calls no other function, it has no need to move the stack pointer by more than it needs, though it doesn't matter in this case. I only subtracts enough from rsp to ensure that the array is

x86-64 Big Integer Representation?

阅读更多关于 x86-64 Big Integer Representation?

How do hig-performance native big-integer libraries on x86-64 represent a big integer in memory? (or does it vary? Is there a most common way?) Naively I was thinking about storing them as 0-terminated strings of numbers in base 2 64 . For example suppose X is in memory as: [8 bytes] Dn . . [8 bytes] D2 [8 bytes] D1 [8 bytes] D0 [8 bytes] 0 Let B = 2 64 Then X = D n * B n + ... + D 2 * B 2 + D 1 * B 1 + D 0 The empty string (i.e. 8 bytes of zero) means zero. Is this a reasonable way? What are the pros and cons of this way? Is there a better way? How would you handle signedness? Does 2's

x86_64: is IMUL faster than 2x SHL + 2x ADD?

阅读更多关于 x86_64: is IMUL faster than 2x SHL + 2x ADD?

When looking at the assembly produced by Visual Studio (2015U2) in /O2 (release) mode I saw that this 'hand-optimized' piece of C code is translated back into a multiplication: int64_t calc(int64_t a) { return (a << 6) + (a << 16) - a; } Assembly: imul rdx,qword ptr [a],1003Fh So I was wondering if that is really faster than doing it the way it is written, something like: mov rbx,qword ptr [a] mov rax,rbx shl rax,6 mov rcx,rbx shl rcx,10h add rax,rcx sub rax,rbx I was always under the impression that multiplication is always slower than a few shifts/adds? Is that no longer the case with modern

Can't call C standard library function on 64-bit Linux from assembly (yasm) code

阅读更多关于 Can't call C standard library function on 64-bit Linux from assembly (yasm) code

I have a function foo written in assembly and compiled with yasm and GCC on Linux (Ubuntu) 64-bit. It simply prints a message to stdout using puts() , here is how it looks: bits 64 extern puts global foo section .data message: db 'foo() called', 0 section .text foo: push rbp mov rbp, rsp lea rdi, [rel message] call puts pop rbp ret It is called by a C program compiled with GCC: extern void foo(); int main() { foo(); return 0; } Build commands: yasm -f elf64 foo_64_unix.asm gcc -c foo_main.c -o foo_main.o gcc foo_64_unix.o foo_main.o -o foo ./foo Here is the problem: When running the program it

订阅 x86-64