x86-64

Does a legitmate epilog need to include a dummy rsp adjustment even if not otherwise necessary?

余生颓废 提交于 2019-12-06 08:26:58
The x86-64 Windows ABI has the concept of a legitimate epilog , which is a special type of function epilog that can be simulated during exception handling in order to restore the callers context 1 as described here : If the RIP is within an epilog [when an exception occurs], then control is leaving the function, ... and the effects of the epilog must be continued to compute the context of the caller function. To determine if the RIP is within an epilog, the code stream from RIP on is examined. If that code stream can be matched to the trailing portion of a legitimate epilog, then it is in an

How does address operand affect performance and size of machine code?

此生再无相见时 提交于 2019-12-06 08:00:16
Starting with 32-bit CPU mode, there are extended address operands available for x86 architecture. One can specify the base address, a displacement, an index register and a scaling factor. For example, we would like to stride through a list of 32-bit integers (every first two from an array of 32-byte-long data structures, %rdi as data index, %rbx as base pointer). addl $8, %rdi # skip eight values: advance index by 8 movl (%rbx, %rdi, 4), %eax # load data: pointer + scaled index movl 4(%rbx, %rdi, 4), %edx # load data: pointer + scaled index + displacement As I know, such complex addressing

How do I write the value in RAX to STDOUT in assembly?

最后都变了- 提交于 2019-12-06 07:18:30
问题 I can use syscall for write to print some data in memory to STDOUT: ssize_t write(int fd, const void *buf, size_t count); That is: movq $1, %rax movq $1, %rdi move address_of_variable %rsi movq $5, %rdx syscall But how can I print register values? UPDATE .text call start start: movq $100, %rdi movq $10, %rsi call print_number ret buffer: .skip 64 bufferend: # rdi = number # rsi = base print_number: leaq bufferend, %rcx movq %rdi, %rax 1: xorq %rdx, %rdx divq %rsi add $'0', %dl cmp $'9', %dl

How can I get the _GLOBAL_OFFSET_TABLE_ address in my program?

南笙酒味 提交于 2019-12-06 06:43:51
问题 I want to get the address of _GLOBAL_OFFSET_TABLE_ in my program. One way is to use the nm command in Linux, maybe redirect the output to a file and parse that file to get address of _GLOBAL_OFFSET_TABLE_. However, that method seems to be quite inefficient. What are some more efficient methods of doing it? 回答1: This appears to work: #include <stdio.h> extern void *_GLOBAL_OFFSET_TABLE_; int main() { printf("_GLOBAL_OFFSET_TABLE = %p\n", &_GLOBAL_OFFSET_TABLE_); return 0; } It gives: $ ./test

how to export a function in GAS assembler?

醉酒当歌 提交于 2019-12-06 06:13:50
Hi I have the following assembly code , .export __ls__11NSDOM_EncapFf .text __ls__11NSDOM_EncapFf: /* first load the symbolic constant*/ movq _IEEE_FP@GOTPCREL(%rip), %r8 /*%r8 is a scratch register*/ movq (%r8), %r9 /* %r9 and %r11 are scratch registers*/ movl (%r9), %r11d /* second, see if it is zero and branch accordingly */ test %r11d, %r11d /* zero call TNS procedure */ /* non-zero call IEEE procedure */ je ____ls__11NSDOM_EncapFf_tns/* constant equals 0 */ jmp ____ls__11NSDOM_EncapFf_ieee/* constant not equal to 0 */ ret I compile the .s file to .o file(compilation is fine) , but when I

x86_64 - Self-modifying code performance

社会主义新天地 提交于 2019-12-06 05:56:02
I am reading the Intel architecture documentation, vol3, section 8.1.3 ; Self-modifying code will execute at a lower level of performance than non-self-modifying or normal code. The degree of the performance deterioration will depend upon the frequency of modification and specific characteristics of the code. So, if I respect the rules: (* OPTION 1 *) Store modified code (as data) into code segment; Jump to new code or an intermediate location; Execute new code; (* OPTION 2 ) Store modified code (as data) into code segment; Execute a serializing instruction; ( For example, CPUID instruction *)

Why does this function push RAX to the stack as the first operation?

佐手、 提交于 2019-12-06 04:22:15
In the assembly of the C++ source below. Why is RAX pushed to the stack? RAX, as I understand it from the ABI could contain anything from the calling function. But we save it here, and then later move the stack back by 8 bytes. So the RAX on the stack is, I think only relevant for the std::__throw_bad_function_call() operation ... ? The code:- #include <functional> void f(std::function<void()> a) { a(); } Output, from gcc.godbolt.org , using Clang 3.7.1 -O3: f(std::function<void ()>): # @f(std::function<void ()>) push rax cmp qword ptr [rdi + 16], 0 je .LBB0_1 add rsp, 8 jmp qword ptr [rdi +

Calling C function from x64 assembly with registers instead of stack

旧时模样 提交于 2019-12-06 04:08:39
This answer puzzled me. According to the standard C calling conventions , the standard way to call C functions is to push arguments to the stack and to call the subroutine. That is clearly different from syscalls , where you set different registers with appropriate arguments and then syscall . However, the answer mentioned above gives this GAS code: .global main .section .data hello: .asciz "Hello\n" .section .text main: movq $hello, %rdi movq $0, %rax call printf movq $0, %rax ret which works with gcc hello.s -o hello . The part that calls printf is: movq $hello, %rdi movq $0, %rax call

OpenCL speed and float point precision

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-06 04:05:21
I have just started working with OpenCL. However, I have found some weird behavior of OpenCl, which i can't understand. The source i built and tested, was http://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism . I have a ATI Radeon HD 4770, and a AMD Fx 6200 3.8 ghz 6 core cpu. Speed Firstly the speed is not linearly to the number of maximum work group items. I ran App profiler to analyze the time spent during the kernel execution. The result was a bit shocking, my GPU which can only handle 256 work items per group, used 2.23008 milliseconds to calculate square of

Switch from 32bit mode to 64 bit (long mode) on 64bit linux

偶尔善良 提交于 2019-12-06 03:45:41
问题 My program is in 32bit mode running on x86_64 CPU (64bit OS, ubuntu 8.04). Is it possible to switch to 64bit mode (long mode) in user mode temporarily? If so, how? Background story: I'm writing a library linked with 32bit mode program, so it must be 32bit mode at start. However, I'd like to use faster x86_64 intructions for better performance. So I want to switch to 64bit mode do some pure computation (no OS interaction; no need 64bit addressing) and come back to 32bit before returning to