x86-64 | 易学教程

What efficient way to load x64 ymm register with 4 seperated doubles?

阅读更多关于 What efficient way to load x64 ymm register with 4 seperated doubles?

What is the most efficient way to load a x64 ymm register with 4 doubles evenly spaced i.e. a contiguous set of doubles 0 1 2 3 4 5 6 7 8 9 10 .. 100 And i want to load for example 0, 10, 20, 30 4 doubles at any position i.e. i want to load for example 1, 6, 22, 43 zx485 The simplest approach is VGATHERQPD which is an AVX2 instruction available on Haswell and up. VGATHERQPD ymm1, [rsi+xmm7*8], ymm2 Using dword indices specified in vm32x, gather double-pre-cision FP values from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1. which can achieve

where is amd64 psABI? [closed]

阅读更多关于 where is amd64 psABI? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . The AMD64 psABI used to be hosted at x86-64.org. I have a copy of pdf file and it says explicitly: The architecture specification is available on the web at http://www.x86-64.org/documentation. but http://www.x86-64.org is down for a long time already. Several months at least. Does anyone know where the latest

Printing Stack Frames

阅读更多关于 Printing Stack Frames

is it certain in which register arguments and variables are stored?

阅读更多关于 is it certain in which register arguments and variables are stored?

I'm still uncertain how registers are being used by the assembler say I have a program: int main(int rdi, int rsi, int rdx) { rdx = rdi; return 0; } Would this in assembly be translated into: movq %rdx, %rdi ret rax; I'm new to AT&T and have hard time predicting when a certain register will be used. Looking at this chart from Computer Systems - A programmer's perspective , third edition, R.E. Bryant and D. R. O'Hallaron: charter Is it certain in which register arguments and variables are stored? Only at entry and exit of a function. There is no guarantee as to what registers will be used

Is mfence for rdtsc necessary on x86_64 platform?

阅读更多关于 Is mfence for rdtsc necessary on x86_64 platform?

问题 unsigned int lo = 0; unsigned int hi = 0; __asm__ __volatile__ ( "mfence;rdtsc" : "=a"(lo), "=d"(hi) : : "memory" ); mfence in the above code, is it necessary? Based on my test, cpu reorder is not found. The fragment of test code is included below. inline uint64_t clock_cycles() { unsigned int lo = 0; unsigned int hi = 0; __asm__ __volatile__ ( "rdtsc" : "=a"(lo), "=d"(hi) ); return ((uint64_t)hi << 32) | lo; } unsigned t1 = clock_cycles(); unsigned t2 = clock_cycles(); assert(t2 > t1); 回答1:

If we marked memory as WC(Write Combined), then do we have any consistency automatically?

阅读更多关于 If we marked memory as WC(Write Combined), then do we have any consistency automatically?

As we know on x86 architecture the acquire-release consistency provided automatically - i.e. all operations automatically ordered without any fences, exclude first store and next load operations. (As said Herb Sutter on page 34: https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c ) If we put MFENCE(LFENCE+SFENCE) between them, then store can't be reordered, and load can't be reordered - i.e. we provided sequential consistency . But if we marked memory as WC(Write Combined) , then do we have any consistency automatically without any fences, may

Are Intel x86_64 processors not only pipelined architecture, but also superscalar?

阅读更多关于 Are Intel x86_64 processors not only pipelined architecture, but also superscalar?

Are Intel x86_64 processors not only pipelined architecture, but also superscalar? Pipelining - these two sequences execute in parallel (different stages of the same pipeline-unit in the same clock, for example ADD with 4 stages): stage1 -> stage2 -> stage3 -> stage4 -> nothing nothing -> stage1 -> stage2 -> stage3 -> stage4 Superscalar - these two sequences execute in parallel (two instructions can be launched to different pipeline-units in the same clock, for example ADD and MUL): ADD(stage1) -> ADD(stage2) -> ADD(stage3) MUL(stage1) -> MUL(stage2) -> MUL(stage3) Yes, contemporary Intel

Why am I receiving SIGSEGV when invoking the sys_pause syscall?

阅读更多关于 Why am I receiving SIGSEGV when invoking the sys_pause syscall?

问题 I am trying to create an x86_64 assembly program that displays "SIGTERM received" whenever the SIGTERM signal is sent. My application is using Linux syscalls directly: %define sys_write 0x01 %define sys_rt_sigaction 0x0d %define sys_pause 0x22 %define sys_exit 0x3c %define SIGTERM 0x0f %define STDOUT 0x01 ; Definition of sigaction struct for sys_rt_sigaction struc sigaction .sa_handler resq 1 .sa_flags resq 1 .sa_restorer resq 1 .sa_mask resq 1 endstruc section .data ; Message shown when a

JMP unexpected behavior in Shellcode when next(skipped) instruction is a variable definition

阅读更多关于 JMP unexpected behavior in Shellcode when next(skipped) instruction is a variable definition

Purpose : I was trying to take advantage of the RIP mode in x86-64. Even though the assembly performs as expected on its own, the shellcode does not. The Problem : Concisely what I tried was this, jmp l1 str1: db "some string" l1: other code lea rax, [rel str1] I used the above at various places, it failed only at certain places and succeeded in other places. I tried to play around and could not find any pattern when it fails. When variable(str1: db instruction) position is after the instruction accessing it, it never failed(in my observations). However, I want to remove nulls, hence I placed

Can I make shared library constructors execute before relocations?

阅读更多关于 Can I make shared library constructors execute before relocations?

问题 Background : I'm trying to implement a system like that described in this previous answer. In short, I have an application that links against a shared library (on Linux at present). I would like that shared library to switch between multiple implementations at runtime (for instance, based on whether the host CPU supports a certain instruction set). In its simplest case, I have three distinct shared library files: libtest.so : This is the "vanilla" version of the library that will be used as a