x86-64 | 易学教程

Why are global variables in x86-64 accessed relative to the instruction pointer?

阅读更多关于 Why are global variables in x86-64 accessed relative to the instruction pointer?

I have tried to compile c code to assembly code using gcc -S -fasm foo.c . The c code declare global variable and variable in the main function as shown below: int y=6; int main() { int x=4; x=x+y; return 0; } now I looked in the assembly code that has been generated from this C code and I saw, that the global variable y is stored using the value of the rip instruction pointer. I thought that only const global variable stored in the text segment but, looking at this example it seems that also regular global variables are stored in the text segment which is very weird. I guess that some

What is the purpose of the 40h REX opcode in ASM x64?

阅读更多关于 What is the purpose of the 40h REX opcode in ASM x64?

I've been trying to understand the purpose of the 0x40 REX opcode for ASM x64 instructions. Like for instance, in this function prologue from Kernel32.dll: As you see they use push rbx as: 40 53 push rbx But using just the 53h opcode (without the prefix) also produces the same result: According to this site , the layout for the REX prefix is as follows: So 40h opcode seems to be not doing anything. Can someone explain its purpose? Nathan Fellman the 04xh bytes (i.e. 040h , 041h ... 04fh ) are indeed REX bytes. Each bit in the lower nibble has a meaning, as you listed in your question. The

FillChar and StringOfChar under Delphi 10.2 for Win64 Release Target

阅读更多关于 FillChar and StringOfChar under Delphi 10.2 for Win64 Release Target

问题 I have a question about a specific programming problem in Delphi 10.2 Pascal programming language. The StringOfChar and FillChar don’t work properly under Win64 Release build on CPUs released before year 2012. Expected result of FillChar is just plain sequence of just repeating 8-bit characters in a given memory buffer. Expected result of StringOfChar is the same, but the result is stored inside a string type. But, in fact, when I compile our applications that worked in Delphi prior to 10.2

Address Space Layout Randomization( ALSR ) and mmap

阅读更多关于 Address Space Layout Randomization( ALSR ) and mmap

I expect that due to Address Space Layout Randomization (ALSR) a process forked from another process will have different addresses returned when calling mmap . But as I found out, that was not the case. I made the following test program for that purpose. All the addresses returned by malloc are exactly the same for the parent and the child. Note that the malloc for cl1 , cl2 , pl1 , pl2 internally uses mmap because they are large blocks. So, my question is, why mmap is not returning different addresses even in the presence of ALSR. Maybe its because the seed for randomization here is the same

Storing individual doubles from a packed double vector using Intel AVX

阅读更多关于 Storing individual doubles from a packed double vector using Intel AVX

I'm writing code using the C intrinsics for Intel's AVX instructions. If I have a packed double vector (a __m256d ), what would be the most efficient way (i.e. the least number of operations) to store each of them to a different place in memory (i.e. I need to fan them out to different locations such that they are no longer packed)? Pseudocode: __m256d *src; double *dst; int dst_dist; dst[0] = src[0]; dst[dst_dist] = src[1]; dst[2 * dst_dist] = src[2]; dst[3 * dst_dist] = src[3]; Using SSE, I could do this with __m128 types using the _mm_storel_pi and _mm_storeh_pi intrinsics. I've not been

Intriguing assembly for comparing std::optional of primitive types

阅读更多关于 Intriguing assembly for comparing std::optional of primitive types

问题 Valgrind picked up a flurry Conditional jump or move depends on uninitialised value(s) in one of my unit tests. Inspecting the assembly, I realized that the following code: bool operator==(MyType const& left, MyType const& right) { // ... some code ... if (left.getA() != right.getA()) { return false; } // ... some code ... return true; } Where MyType::getA() const -> std::optional<std::uint8_t> , generated the following assembly: 0x00000000004d9588 <+108>: xor eax,eax 0x00000000004d958a <+110

Why does GCC call libc's sqrt() without using its result?

阅读更多关于 Why does GCC call libc's sqrt() without using its result?

问题 Using GCC 6.3, the following C++ code: #include <cmath> #include <iostream> void norm(double r, double i) { double n = std::sqrt(r * r + i * i); std::cout << "norm = " << n; } generates the following x86-64 assembly: norm(double, double): mulsd %xmm1, %xmm1 subq $24, %rsp mulsd %xmm0, %xmm0 addsd %xmm1, %xmm0 pxor %xmm1, %xmm1 ucomisd %xmm0, %xmm1 sqrtsd %xmm0, %xmm2 movsd %xmm2, 8(%rsp) jbe .L2 call sqrt .L2: movl std::cout, %edi movl $7, %edx movl $.LC1, %esi call std::basic_ostream<char,

Are word-aligned loads faster than unaligned loads on x64 processors?

阅读更多关于 Are word-aligned loads faster than unaligned loads on x64 processors?

Are loads of variables that are aligned on word boundaries faster than unaligned load operations on x86/64 (Intel/AMD 64 bit) processors? A colleague of mine argues that unaligned loads are slow and should be avoided. He cites the padding of items to word boundaries in structs as a proof that unaligned loads are slow. Example: struct A { char a; uint64_t b; }; The struct A as usually a size of 16 bytes. On the other hand, the documentation of the Snappy compressor states that Snappy assumes that "unaligned 32- and 64-bit loads and stores are cheap". According to the source code this is true of

GCC code that seems to break inline assembly rules but an expert believes otherwise

阅读更多关于 GCC code that seems to break inline assembly rules but an expert believes otherwise

I was engaged with an expert who allegedly has vastly superior coding skills than myself who understands inline assembly far better than I ever could. One of the claims is that as long as an operand appears as an input constraint, you don't need to list it as a clobber or specify that the register has been potentially modified by the inline assembly. The conversation came about when someone else was trying to get assistance on a memset implementation that was effectively coded this way: void *memset(void *dest, int value, size_t count) { asm volatile ("cld; rep stosb" :: "D"(dest), "c"(count),

gcc argument register spilling on x86-64

阅读更多关于 gcc argument register spilling on x86-64

I'm doing some experimenting with x86-64 assembly. Having compiled this dummy function: long myfunc(long a, long b, long c, long d, long e, long f, long g, long h) { long xx = a * b * c * d * e * f * g * h; long yy = a + b + c + d + e + f + g + h; long zz = utilfunc(xx, yy, xx % yy); return zz + 20; } With gcc -O0 -g I was surprised to find the following in the beginning of the function's assembly: 0000000000400520 <myfunc>: 400520: 55 push rbp 400521: 48 89 e5 mov rbp,rsp 400524: 48 83 ec 50 sub rsp,0x50 400528: 48 89 7d d8 mov QWORD PTR [rbp-0x28],rdi 40052c: 48 89 75 d0 mov QWORD PTR [rbp