x86-64

What efficient way to load x64 ymm register with 4 seperated doubles?

﹥>﹥吖頭↗ 提交于 2019-12-06 17:26:53
What is the most efficient way to load a x64 ymm register with 4 doubles evenly spaced i.e. a contiguous set of doubles 0 1 2 3 4 5 6 7 8 9 10 .. 100 And i want to load for example 0, 10, 20, 30 4 doubles at any position i.e. i want to load for example 1, 6, 22, 43 zx485 The simplest approach is VGATHERQPD which is an AVX2 instruction available on Haswell and up. VGATHERQPD ymm1, [rsi+xmm7*8], ymm2 Using dword indices specified in vm32x, gather double-pre-cision FP values from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1. which can achieve

where is amd64 psABI? [closed]

不打扰是莪最后的温柔 提交于 2019-12-06 17:11:16
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . The AMD64 psABI used to be hosted at x86-64.org. I have a copy of pdf file and it says explicitly: The architecture specification is available on the web at http://www.x86-64.org/documentation. but http://www.x86-64.org is down for a long time already. Several months at least. Does anyone know where the latest

Printing Stack Frames

 ̄綄美尐妖づ 提交于 2019-12-06 15:46:38
So I am currently learning about stack frames, and I wanted to experiment printing the stack frame (manually) of a function. I have the following picture in mind of a stack frame (I may be wrong): | | 0xffff0fdc +--------------------------------+ | ... | 0xffff0fd8 +--------------------------------+ | parameter 2 | 0xffff0fd4 +--------------------------------+ | parameter 1 | 0xffff0fd0 +--------------------------------+ | return address | 0xffff0fcc +--------------------------------+ | local variable 2 | 0xffff0fc8 +--------------------------------+ | local variable 1 | 0xffff0fc4 +----------

is it certain in which register arguments and variables are stored?

时光怂恿深爱的人放手 提交于 2019-12-06 13:27:15
I'm still uncertain how registers are being used by the assembler say I have a program: int main(int rdi, int rsi, int rdx) { rdx = rdi; return 0; } Would this in assembly be translated into: movq %rdx, %rdi ret rax; I'm new to AT&T and have hard time predicting when a certain register will be used. Looking at this chart from Computer Systems - A programmer's perspective , third edition, R.E. Bryant and D. R. O'Hallaron: charter Is it certain in which register arguments and variables are stored? Only at entry and exit of a function. There is no guarantee as to what registers will be used

Is mfence for rdtsc necessary on x86_64 platform?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-06 11:56:20
问题 unsigned int lo = 0; unsigned int hi = 0; __asm__ __volatile__ ( "mfence;rdtsc" : "=a"(lo), "=d"(hi) : : "memory" ); mfence in the above code, is it necessary? Based on my test, cpu reorder is not found. The fragment of test code is included below. inline uint64_t clock_cycles() { unsigned int lo = 0; unsigned int hi = 0; __asm__ __volatile__ ( "rdtsc" : "=a"(lo), "=d"(hi) ); return ((uint64_t)hi << 32) | lo; } unsigned t1 = clock_cycles(); unsigned t2 = clock_cycles(); assert(t2 > t1); 回答1:

If we marked memory as WC(Write Combined), then do we have any consistency automatically?

三世轮回 提交于 2019-12-06 11:29:32
As we know on x86 architecture the acquire-release consistency provided automatically - i.e. all operations automatically ordered without any fences, exclude first store and next load operations. (As said Herb Sutter on page 34: https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c ) If we put MFENCE(LFENCE+SFENCE) between them, then store can't be reordered, and load can't be reordered - i.e. we provided sequential consistency . But if we marked memory as WC(Write Combined) , then do we have any consistency automatically without any fences, may

Are Intel x86_64 processors not only pipelined architecture, but also superscalar?

为君一笑 提交于 2019-12-06 11:04:10
Are Intel x86_64 processors not only pipelined architecture, but also superscalar? Pipelining - these two sequences execute in parallel (different stages of the same pipeline-unit in the same clock, for example ADD with 4 stages): stage1 -> stage2 -> stage3 -> stage4 -> nothing nothing -> stage1 -> stage2 -> stage3 -> stage4 Superscalar - these two sequences execute in parallel (two instructions can be launched to different pipeline-units in the same clock, for example ADD and MUL): ADD(stage1) -> ADD(stage2) -> ADD(stage3) MUL(stage1) -> MUL(stage2) -> MUL(stage3) Yes, contemporary Intel

Why am I receiving SIGSEGV when invoking the sys_pause syscall?

爱⌒轻易说出口 提交于 2019-12-06 09:25:35
问题 I am trying to create an x86_64 assembly program that displays "SIGTERM received" whenever the SIGTERM signal is sent. My application is using Linux syscalls directly: %define sys_write 0x01 %define sys_rt_sigaction 0x0d %define sys_pause 0x22 %define sys_exit 0x3c %define SIGTERM 0x0f %define STDOUT 0x01 ; Definition of sigaction struct for sys_rt_sigaction struc sigaction .sa_handler resq 1 .sa_flags resq 1 .sa_restorer resq 1 .sa_mask resq 1 endstruc section .data ; Message shown when a

JMP unexpected behavior in Shellcode when next(skipped) instruction is a variable definition

只愿长相守 提交于 2019-12-06 09:12:41
Purpose : I was trying to take advantage of the RIP mode in x86-64. Even though the assembly performs as expected on its own, the shellcode does not. The Problem : Concisely what I tried was this, jmp l1 str1: db "some string" l1: other code lea rax, [rel str1] I used the above at various places, it failed only at certain places and succeeded in other places. I tried to play around and could not find any pattern when it fails. When variable(str1: db instruction) position is after the instruction accessing it, it never failed(in my observations). However, I want to remove nulls, hence I placed

Can I make shared library constructors execute before relocations?

倖福魔咒の 提交于 2019-12-06 09:02:17
问题 Background : I'm trying to implement a system like that described in this previous answer. In short, I have an application that links against a shared library (on Linux at present). I would like that shared library to switch between multiple implementations at runtime (for instance, based on whether the host CPU supports a certain instruction set). In its simplest case, I have three distinct shared library files: libtest.so : This is the "vanilla" version of the library that will be used as a