x86-64 | 易学教程

Detecting architecture at compile time from MASM/MASM64

阅读更多关于 Detecting architecture at compile time from MASM/MASM64

How can I detect at compile time from an ASM source file if the target architecture is I386 or AMD64? I am using masm(ml.exe)/masm64(ml64.exe) to assemble file32.asm and file64.asm. It would be nice to create a single file, file.asm, which should include either file32.asm, or file64.asm, depending on the architecture. Ideally, I would like to be able to write something like: IFDEF amd64 include file64.asm ELSE include file32.asm ENDIF Also, if needed, I can run ml.exe and ml64.exe with different command line options. Thanks! If I understand you correctly, you're looking for some sort of built

Efficient (on Ryzen) way to extract the odd elements of a m256 into a m128?

阅读更多关于 Efficient (on Ryzen) way to extract the odd elements of a __m256 into a __m128?

Is there an intrinsic or another efficient way for repacking high/low 32-bit components of 64-bit components of AVX register into an SSE register? A solution using AVX2 is ok. So far I'm using the following code, but profiler says it's slow on Ryzen 1800X : // Global constant const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1); // ... // function code __m256i x = /* computed here */; const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(x), gHigh32Permute); // This seems to take 3 cycles On Intel, your code would be optimal. One 1-uop instruction is the

xorl %eax, %eax in x86_64 assembly code produced by gcc

阅读更多关于 xorl %eax, %eax in x86_64 assembly code produced by gcc

I'm a total noob at assembly, just poking around a bit to see what's going on. Anyway, I wrote a very simple function: void multA(double *x,long size) { long i; for(i=0; i<size; ++i){ x[i] = 2.4*x[i]; } } I compiled it with: gcc -S -m64 -O2 fun.c And I get this: .file "fun.c" .text .p2align 4,,15 .globl multA .type multA, @function multA: .LFB34: .cfi_startproc testq %rsi, %rsi jle .L1 movsd .LC0(%rip), %xmm1 xorl %eax, %eax .p2align 4,,10 .p2align 3 .L3: movsd (%rdi,%rax,8), %xmm0 mulsd %xmm1, %xmm0 movsd %xmm0, (%rdi,%rax,8) addq $1, %rax cmpq %rsi, %rax jne .L3 .L1: rep ret .cfi_endproc

Absolute addressing for runtime code replacement in x86_64

阅读更多关于 Absolute addressing for runtime code replacement in x86_64

I'm currently using some code replace scheme in 32 bit where the code which is moved to another position, reads variables and a class pointer. Since x86_64 does not support absolute addressing I have trouble getting the correct addresses for the variables at the new position of the code. The problem in detail is, that because of rip relative addressing the instruction pointer address is different than at compile time. So is there a way to use absolute addressing in x86_64 or another way to get addresses of variables not instruction pointer relative? Something like: leaq variable(%%rax), %%rbx

Is it possible to run 16 bit code in an operating system that supports Intel IA-32e mode?

阅读更多关于 Is it possible to run 16 bit code in an operating system that supports Intel IA-32e mode?

In the Intel 64 & IA-32 architecutures manual vol 3A, Chapter 9 Processor Management and Initialization, I found the the following: Compatibility mode execution is selected on a code-segment basis. This mode allows legacy applications to coexist with 64-bit applications running in 64-bit mode. An operating system running in IA-32e mode can execute existing 16-bit and 32-bit applications by clearing their code-segment descriptor's CS.L bit to 0. Does this mean that legacy 16-bit & 32-bit application can coexist with 64-bit application on an operating system running in IA-32e mode. But as I know

How much cycles math functions take on modern processors

阅读更多关于 How much cycles math functions take on modern processors

问题 We know that modern processors execute instructions such as cosine and sin directly on the processor as they have opcodes for it. My question is how much cycles these instructions normally take. Do they take constant time or depend upon input parameters? 回答1: Talking about "cycles for an instruction" for modern processors got to be difficult quite a while ago. Processors these days contain multiple execution cores, their operation can overlap and can execute out-of-order. A good example of

Nasm - Symbol `printf' causes overflow in R_X86_64_PC32 relocation

阅读更多关于 Nasm - Symbol `printf' causes overflow in R_X86_64_PC32 relocation

I am trying to create a simple program in nasm that should display the letter a . however, It is giving me a Segfault and saying this: ./a.out: Symbol `printf' causes overflow in R_X86_64_PC32 relocation Segmentation fault (core dumped) Basically, I am trying to move the value 0x61 (hex for letter a) into memory address 1234, and then pass that as an argument to printf. Here is my exact code: extern printf section .text global main main: push rbp mov rax,0 mov qword [1234], 0x61 ; move 0x61 into address 1234 mov rdi, qword [1234] ; mov address 1234 into rdi call printf ; should print the

32-bit pointers with the x86-64 ISA: why not?

阅读更多关于 32-bit pointers with the x86-64 ISA: why not?

The x86-64 instruction set adds more registers and other improvements to help streamline executable code. However, in many applications the increased pointer size is a burden. The extra, unused bytes in every pointer clog up the cache and might even overflow RAM. GCC, for example, builds with the -m32 flag, and I assume this is the reason. It's possible to load a 32-bit value and treat it as a pointer. This doesn't necessitate extra instructions, just load/compute the 32 bits and load from the resulting address. The trick won't be portable, though, as platforms have different memory maps. On

Difference in ABI between x86_64 Linux functions and syscalls

阅读更多关于 Difference in ABI between x86_64 Linux functions and syscalls

The x86_64 SysV ABI 's function calling convention defines integer argument #4 to be passed in the rcx register. The Linux kernel syscall ABI, on the other hand, uses r10 for that same purpose. All other arguments are passed in the same registers for both functions and syscalls. This leads to some strange things. Check out, for example, the implementation of mmap in glibc for the x32 platform (for which the same discrepancy exists): 00432ce0 <__mmap>: 432ce0: 49 89 ca mov %rcx,%r10 432ce3: b8 09 00 00 40 mov $0x40000009,%eax 432ce8: 0f 05 syscall So all register are already in place, except we

Perf startup overhead: Why does a simple static executable which performs MOV + SYS_exit have so many stalled cycles (and instructions)?

阅读更多关于 Perf startup overhead: Why does a simple static executable which performs MOV + SYS_exit have so many stalled cycles (and instructions)?

I'm trying to understand how to measure performance and decided to write the very simple program: section .text global _start _start: mov rax, 60 syscall And I ran the program with perf stat ./bin The thing I was surprised by is the stalled-cycles-frontend was too high. 0.038132 task-clock (msec) # 0.148 CPUs utilized 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec 2 page-faults # 0.052 M/sec 107,386 cycles # 2.816 GHz 81,229 stalled-cycles-frontend # 75.64% frontend cycles idle 47,654 instructions # 0.44 insn per cycle # 1.70 stalled cycles per insn 8,601 branches # 225.559 M

订阅 x86-64