assembly

Which is generally faster to test for zero in x86 ASM: “TEST EAX, EAX” versus “TEST AL, AL”?

£可爱£侵袭症+ 提交于 2021-01-27 06:28:10
问题 Which is generally faster to test the byte in AL for zero / non-zero? TEST EAX, EAX TEST AL, AL Assume a previous "MOVZX EAX, BYTE PTR [ESP+4]" instruction loaded a byte parameter with zero-extension to the remainder of EAX, preventing the combine-value penalty that I already know about. So AL=EAX and there are no partial-register penalties for reading EAX. Intuitively just examining AL might let you think it's faster, but I'm betting there are more penalty issues to consider for byte access

How can the timer interrupt be 0x08 if the first 32 interrupts are reserved for exceptions?

倖福魔咒の 提交于 2021-01-27 05:05:06
问题 I am developing an embedded program for an intel i386, and I am trying to figure out how to use the hardware timer. I have read here (and other places) that the timer interrupt is 0x08, but this page (And various other sources) say that the first 32 interrupts are reserved for exceptions, and interrupt 0x08 specifically is for double fault. Which is true? How can I setup a timer interrupt handler, using either assembly or very low-level C with no operating system calls? I am developing a

How Switch Statement Works

偶尔善良 提交于 2021-01-27 05:02:20
问题 How does a switch statement immediately drop to the correct location in memory? With nested if-statements, it has to perform comparisons with each one, but with a switch statement it goes directly to the correct case. How is this implemented? 回答1: There are many different ways to compile a switch statement into machine code. Here are a few: The compiler can produce a series of tests, which is not so inefficient as only about log 2 (N) tests are enough to dispatch a value among N possible

Declaring an empty destructor prevents the compiler from calling memmove() for copying contiguous objects

别来无恙 提交于 2021-01-27 04:55:26
问题 Consider the following definition of Foo : struct Foo { uint64_t data; }; Now, consider the following definition of Bar , which has the same data member as Foo , but has an empty user-declared destructor: struct Bar { ~Bar(){} // <-- empty user-declared dtor uint64_t data; }; Using gcc 8.2 with -O2 , the function copy_foo() : void copy_foo(const Foo* src, Foo* dst, size_t len) { std::copy(src, src + len, dst); } results in the following assembly code: copy_foo(Foo const*, Foo*, size_t): salq

Why doesn't Ice Lake have MOVDIRx like tremont? Do they already have better ones?

杀马特。学长 韩版系。学妹 提交于 2021-01-27 04:46:49
问题 I notice that Intel Tremont has 64 bytes store instructions with MOVDIRI and MOVDIR64B. Those guarantees atomic write to memory, whereas don't guarantee the load atomicity. Moreover, the write is weakly ordered, immediately followed fencing may be needed. I find no MOVDIRx in IceLake. Why doesn't Ice Lake need such instructions like MOVDIRx ? (At the bottom of page 15) Intel® ArchitectureInstruction Set Extensions and Future FeaturesProgramming Reference https://software.intel.com/sites

Difference between MOV r/m8,r8 and MOV r8,r/m8

荒凉一梦 提交于 2021-01-27 04:39:44
问题 By looking at intel volume of instructions, I found this: 1) 88 /r MOV r/m8,r8 2) 8A /r MOV r8,r/m8 When I write a line like this in NASM, and assemble it with the listing option: mov al, bl I get this in the listing: 88D8 mov al, bl So obviously NASM chosed the first instruction of the two above, but isn't the second instruction an option two? if so, on what basis did NASM chosed the first? 回答1: These two encodings exist because a modr/m byte can only encode one memory operand. So to allow

In x86-64 asm: is there a way of optimising two adjacent 32-bit stores / writes to memory if the source operands are two immediate values?

拈花ヽ惹草 提交于 2021-01-27 03:59:22
问题 Is there a good way of optimising this code (x86-64) ? mov dword ptr[rsp], 0; mov dword ptr[rsp+4], 0 where the immediate values could be any values, not necessarily zero, but in this instance always immediate constants. Is the original pair of stores even slow? Write-combining in the hardware and parallel operation of the μops might just make everything ridiculously fast anyway? I’m wondering if there is no problem to fix. I’m thinking of something like (don’t know if the following

ARM prefetch workaround

烂漫一生 提交于 2021-01-27 03:47:34
问题 I have a situation where some of the address space is sensitive in that you read it you crash as there is nobody there to respond to that address. pop {r3,pc} bx r0 0: e8bd8008 pop {r3, pc} 4: e12fff10 bx r0 8: bd08 pop {r3, pc} a: 4700 bx r0 The bx was not created by the compiler as an instruction, instead it is the result of a 32 bit constant that didnt fit as an immediate in a single instruction so a pc relative load is setup. This is basically the literal pool. And it happens to have bits

ARM prefetch workaround

寵の児 提交于 2021-01-27 03:45:52
问题 I have a situation where some of the address space is sensitive in that you read it you crash as there is nobody there to respond to that address. pop {r3,pc} bx r0 0: e8bd8008 pop {r3, pc} 4: e12fff10 bx r0 8: bd08 pop {r3, pc} a: 4700 bx r0 The bx was not created by the compiler as an instruction, instead it is the result of a 32 bit constant that didnt fit as an immediate in a single instruction so a pc relative load is setup. This is basically the literal pool. And it happens to have bits

Is there any way to get correct rounding with the i387 fsqrt instruction?

天涯浪子 提交于 2021-01-27 02:35:48
问题 Is there any way to get correct rounding with the i387 fsqrt instruction?... ... aside from changing the precision mode in the x87 control word - I know that's possible, but it's not a reasonable solution because it has nasty reentrancy-type issues where the precision mode will be wrong if the sqrt operation is interrupted. The issue I'm dealing with is as follows: the x87 fsqrt opcode performs a correctly-rounded (per IEEE 754) square root operation in the precision of the fpu registers,