x86-64

On x86-64, is the “movnti” or “movntdq” instruction atomic when system crash?

故事扮演 提交于 2021-01-27 05:35:12
问题 When using persistent memory like Intel optane DCPMM, is it possible to see partial result after reboot if system crash(power outage) in execution of movnt instruction? For: 4 or 8 byte movnti which x86 guarantees atomic for other purposes? 16-byte SSE movntdq / movntps which aren't guaranteed atomic but which in practice probably are on CPUs supporting persistent memory. 32-byte AVX vmovntdq / vmovntps 64-byte AVX512 vmovntdq / vmovntps full-line stores bonus question: MOVDIR64B which has

Is clflush or clflushopt atomic when system crash?

有些话、适合烂在心里 提交于 2021-01-27 05:27:33
问题 Commonly, cacheline is 64B but atomicity of non-volatile memory is 8B. For example: x[1]=100; x[2]=100; clflush(x); x is cacheline aligned, and is initially set to 0 . System crashs in clflush(); Is it possible x[1]=0 , x[2]=100 after reboot? 回答1: Under the following assumptions: I assume that the code you've shown represents a sequence of x86 assembly instructions rather than actual C code that is yet to be compiled. I also assume that the code is being executed on a Cascade Lake processor

Declaring an empty destructor prevents the compiler from calling memmove() for copying contiguous objects

别来无恙 提交于 2021-01-27 04:55:26
问题 Consider the following definition of Foo : struct Foo { uint64_t data; }; Now, consider the following definition of Bar , which has the same data member as Foo , but has an empty user-declared destructor: struct Bar { ~Bar(){} // <-- empty user-declared dtor uint64_t data; }; Using gcc 8.2 with -O2 , the function copy_foo() : void copy_foo(const Foo* src, Foo* dst, size_t len) { std::copy(src, src + len, dst); } results in the following assembly code: copy_foo(Foo const*, Foo*, size_t): salq

How does loop address alignment affect the speed on Intel x86_64?

一世执手 提交于 2021-01-27 04:13:29
问题 I'm seeing 15% performance degradation of the same C++ code compiled to exactly same machine instructions but located on differently aligned addresses. When my tiny main loop starts at 0x415220 it's faster then when it is at 0x415250. I'm running this on Intel Core2 Duo. I use gcc 4.4.5 on x86_64 Ubuntu. Can anybody explain the cause of slowdown and how I can force gcc to optimally align the loop? Here is the disassembly for both cases with profiler annotation: 415220 576 12.56%

In x86-64 asm: is there a way of optimising two adjacent 32-bit stores / writes to memory if the source operands are two immediate values?

拈花ヽ惹草 提交于 2021-01-27 03:59:22
问题 Is there a good way of optimising this code (x86-64) ? mov dword ptr[rsp], 0; mov dword ptr[rsp+4], 0 where the immediate values could be any values, not necessarily zero, but in this instance always immediate constants. Is the original pair of stores even slow? Write-combining in the hardware and parallel operation of the μops might just make everything ridiculously fast anyway? I’m wondering if there is no problem to fix. I’m thinking of something like (don’t know if the following

The x86 disassembly for C code generates: orq $0x0, %(rsp)

我们两清 提交于 2021-01-24 13:30:49
问题 I have written the following C code: It simply allocates an array of 1000000 integers and another integer, and sets the first integer of the array to 0 I compiled this using gcc -g test.c -o test -fno-stack-protector It gives a very weird disassembly: Apparently it keeps allocating 4096 bytes on the stack in a loop, and "or"s every 4096th byte with 0 and then once it reaches 3997696 bytes, it then further allocates 2184 bytes. It then proceeds to set the 4000000th byte (which was never

Does a Length-Changing Prefix (LCP) incur a stall on a simple x86_64 instruction?

核能气质少年 提交于 2021-01-20 04:49:33
问题 Consider a simple instruction like mov RCX, RDI # 48 89 f9 The 48 is the REX prefix for x86_64. It is not an LCP. But consider adding an LCP (for alignment purposes): .byte 0x67 mov RCX, RDI # 67 48 89 f9 67 is an address size prefix which in this case is for an instruction without addresses. This instruction also has no immediates, and it doesn't use the F7 opcode (False LCP stalls; F7 would be TEST, NOT, NEG, MUL, IMUL, DIV + IDIV). Assume that it doesn't cross a 16-byte boundary either.

Does a Length-Changing Prefix (LCP) incur a stall on a simple x86_64 instruction?

浪尽此生 提交于 2021-01-20 04:48:03
问题 Consider a simple instruction like mov RCX, RDI # 48 89 f9 The 48 is the REX prefix for x86_64. It is not an LCP. But consider adding an LCP (for alignment purposes): .byte 0x67 mov RCX, RDI # 67 48 89 f9 67 is an address size prefix which in this case is for an instruction without addresses. This instruction also has no immediates, and it doesn't use the F7 opcode (False LCP stalls; F7 would be TEST, NOT, NEG, MUL, IMUL, DIV + IDIV). Assume that it doesn't cross a 16-byte boundary either.

Does a Length-Changing Prefix (LCP) incur a stall on a simple x86_64 instruction?

落爺英雄遲暮 提交于 2021-01-20 04:47:11
问题 Consider a simple instruction like mov RCX, RDI # 48 89 f9 The 48 is the REX prefix for x86_64. It is not an LCP. But consider adding an LCP (for alignment purposes): .byte 0x67 mov RCX, RDI # 67 48 89 f9 67 is an address size prefix which in this case is for an instruction without addresses. This instruction also has no immediates, and it doesn't use the F7 opcode (False LCP stalls; F7 would be TEST, NOT, NEG, MUL, IMUL, DIV + IDIV). Assume that it doesn't cross a 16-byte boundary either.

Should using MOV instruction to set SS to 0x0000 cause fault #GP(0) in 64-bit mode?

天涯浪子 提交于 2021-01-19 21:18:53
问题 This question is inspired by a Reddit question in r/osdev except that this question focuses on the SS register. One may say RTFM (ISA entry for MOV), but when this question comes up it can get varying answers even among OS developers. Question : Should using the MOV instruction to set SS to 0x0000 cause a general protection fault #GP(0) in 64-bit mode? For example: If I am in 64-bit mode with a Current Privilege level (CPL) of 0, should I expect to see a #GP(0) with this code snippet: NULL