x86-64

Differences in the initialization of the EAX register when calling a function in C and C++

我的梦境 提交于 2019-12-01 15:38:49
There is a curious difference between assemblies of a small program, when compiled as a C-program or as a C++-program (for Linux x86-64). The code in question: int fun(); int main(){ return fun(); } Compiling it as a C-program (with gcc -O2 ) yields: main: xorl %eax, %eax jmp fun But compiling it as a C++-program (with g++ -02 ) yields: main: jmp _Z3funv I find it puzzling, that the C-version initializes the return value of the main-function with 0 ( xorl %eax, %eax ). Which feature of the C-language is responsible for this necessity? Edit: It is true that, for int fun(void); the is no

assembly cltq and movslq difference

十年热恋 提交于 2019-12-01 15:31:10
问题 Chapter 3 of Computer Systems A Programmer's Perspective (2nd Edition) mentions that cltq is equivalent to movslq %eax, %rax . Why did they create a new instruction ( cltq ) instead of just using movslq %eax,%rax ? Isn't that redundant? 回答1: TL;DR : use cltq when possible, because it's one byte shorter than the exactly-equivalent movslq %eax, %rax . That's a very minor advantage (so don't sacrifice anything else to make this happen) but choose eax if you're going to want to sign-extend it a

Using SIMD on amd64, when is it better to use more instructions vs. loading from memory?

≡放荡痞女 提交于 2019-12-01 15:21:24
I have some highly perf sensitive code. A SIMD implementation using SSEn and AVX uses about 30 instructions, while a version that uses a 4096 byte lookup table uses about 8 instructions. In a microbenchmark, the lookup table is faster by 40%. If I microbenchmark, trying to invalidate the cache very 100 iterations, they appear about the same. In my real program, it appears that the non-loading version is faster, but it's really hard to get a provably good measurement, and I've had measurements go both ways. I'm just wondering if there are some good ways to think about which one would be better

Why are some Haswell AVX latencies advertised by Intel as 3x slower than Sandy Bridge?

狂风中的少年 提交于 2019-12-01 15:16:23
In the Intel intrinsics webapp , several operations seem to have worsened from Sandy Bridge to Haswell. For example, many insert operations like _mm256_insertf128_si256 show a cost table like the following: Performance Architecture Latency Throughput Haswell 3 - Ivy Bridge 1 - Sandy Bridge 1 - I found this difference puzzling. Is this difference because there are new instructions that replace these ones or something that compensates for it (which ones)? Does anyone know if Skylake changes this model further? Peter Cordes TL:DR : all lane-crossing shuffles / inserts / extracts have 3c latency

Using SIMD on amd64, when is it better to use more instructions vs. loading from memory?

…衆ロ難τιáo~ 提交于 2019-12-01 15:02:25
问题 I have some highly perf sensitive code. A SIMD implementation using SSEn and AVX uses about 30 instructions, while a version that uses a 4096 byte lookup table uses about 8 instructions. In a microbenchmark, the lookup table is faster by 40%. If I microbenchmark, trying to invalidate the cache very 100 iterations, they appear about the same. In my real program, it appears that the non-loading version is faster, but it's really hard to get a provably good measurement, and I've had measurements

Problem switching to v8086 mode from 32-bit protected mode by setting EFLAGS.VM to 1

我只是一个虾纸丫 提交于 2019-12-01 15:00:57
问题 I'm in 32-bit protected mode running at current privilege level (CPL=0). I'm trying to enter v8086 mode by setting EFLAGS.VM (Bit 17) flag to 1 (and IOPL to 0) and doing a FAR JMP to my 16-bit real mode code. I get the current flags using PUSHF; set EFLAGS.VM (bit 17) to 1; set EFLAGS.IOPL (bit 22 and bit 23) to 0; set the new EFLAGS with POPF. The code for this looks like: bits 32 cli [snip] pushf ; Get current EFLAGS pop eax or eax, 1<<EFLAGS_VM_BIT ; Set VM flag to enter v8086 mode and eax

How much does function alignment actually matter on modern processors?

二次信任 提交于 2019-12-01 15:00:45
问题 When I compile C code with a recent compiler on an amd64 or x86 system, functions are aligned to a multiple of 16 bytes. How much does this alignment actually matter on modern processors? Is there a huge performance penalty associated with calling an unaligned function? Benchmark I ran the following microbenchmark ( call.S ): // benchmarking performance penalty of function alignment. #include <sys/syscall.h> #ifndef SKIP # error "SKIP undefined" #endif #define COUNT 1073741824 .globl _start

What does this Intel jmpq instruction do?

会有一股神秘感。 提交于 2019-12-01 14:49:18
问题 How is the address 0x600860 computed in the Intel instruction below? 0x4003b8 + 0x2004a2 = 60085a , so I don't see how the computation is carried out. 0x4003b8 <puts@plt>: jmpq *0x2004a2(%rip) # 0x600860 <puts@got.plt> 回答1: On Intel, JMP, CALL, etc. are relative to the program counter of the next instruction. The next instruction in your case was at 0x4003be , and 0x4003be + 0x2004a2 == 0x600860 回答2: It's AT&T syntax for a memory-indirect JMP with a RIP-relative addressing mode. The jump

What is _GLOBAL_OFFSET_TABLE?

江枫思渺然 提交于 2019-12-01 14:41:08
问题 Using the nm command in Linux to see the symbols in my program, I see a symbol by the name _GLOBAL_OFFSET_TABLE_ as shown below. Can somebody elaborate what is _GLOBAL_OFFSET_TABLE_ used for? 0000000000614018 d _GLOBAL_OFFSET_TABLE_ 回答1: _GLOBAL_OFFSET_TABLE_ is used to locate the real addresses of globals (functions, variables etc) for PIC (Position-Independent Code), its commonly referred to as the GOT, you can read up on it here and a more indepth one here. 来源: https://stackoverflow.com

movq and 64 bit numbers

╄→尐↘猪︶ㄣ 提交于 2019-12-01 14:26:29
When I write to a register, everything is fine, movq $0xffffffffffffffff, %rax But I get Error: operand size mismatch when I write to a memory location, movq $0xffffffffffffffff, -8(%rbp) Why is that? I see in compiled C code that in asm these numbers are split in two and two movl instructions show up. Maybe you can tell me where the mowq and other instructions are documented. Why is that? Because MOV r64, imm64 is a valid x86 instruction, but MOV r/m64, imm64 is not (there's no encoding for it). I see in compiled C code that in asm these numbers are split in two and two movl instructions show