replace inline assembly tailcall function epilogue with Intrinsics for x86/x64 msvc

老子叫甜甜 提交于 2019-12-24 11:35:20

问题


I took an inactive project and already fixed a lot in it, but I don't get an Intrinsics replacement correctly to work for the used inline assembly, which is no longer supported in the x86/x64 msvc compilers.

#define XCALL(uAddr)  \
__asm { mov esp, ebp }   \
__asm { pop ebp }        \
__asm { mov eax, uAddr } \
__asm { jmp eax }

Use cases:

static oCMOB * CreateNewInstance() {
    XCALL(0x00718590);
}

int Copy(class zSTRING const &, enum zTSTR_KIND const &) {
    XCALL(0x0046C2D0);
}

void TrimLeft(char) {
    XCALL(0x0046C630);
}

回答1:


This snippet goes at the bottom of a function (which can't inline, and must be compiled with ebp as a frame pointer, and no other registers that need restoring). It looks quite brittle, or else it's only useful in cases where you didn't need inline asm at all.

Instead of returning, it jumps to uAddr, which is equivalent to making a tailcall.

There aren't intrinsics for arbitrary jumps or manipulation of the stack. If you need that, you're out of luck. It doesn't make sense to ask about this snippet by itself, only with enough context to see how it's being used. i.e. is it important which return address is on the stack, or is it ok for it to compile to call/ret instead of jmp to that address? (See the first version of this answer for a simple example of using it as a function pointer.)


From your update, your use-cases are just a very clunky way to make wrappers for absolute function pointers.

We can instead define static const function pointers of the right types, so no wrapper is needed and the compiler can call directly from wherever you use these. static const is how we let the compile know it can fully inline the function pointers, and doesn't need to store them anywhere as data if it doesn't want to, just like normal static const int xyz = 2;

struct oCMOB;
class zSTRING;
enum zTSTR_KIND { a, b, c };  // enum forward declarations are illegal

// C syntax
//static oCMOB* (*const CreateNewInstance)() = (oCMOB *(*const)())0x00718590;

// C++11
static const auto CreateNewInstance = reinterpret_cast<oCMOB *(*)()>(0x00718590);
// passing an enum by const-reference is dumb.  By value is more efficient for integer types
static const auto Copy = reinterpret_cast<int (*)(class zSTRING const &, enum zTSTR_KIND const &)>(0x0046C2D0);
static const auto TrimLeft = reinterpret_cast<void (*)(char)> (0x0046C630);

void foo() {
    oCMOB *inst = CreateNewInstance();
    (void)inst; // silence unused warning

    zSTRING *dummy = nullptr;  // work around instantiating an incomplete type
    int result = Copy(*dummy, c);
    (void) result;

    TrimLeft('a');
}

It also compiles just fine with x86-64 and 32-bit x86 MSVC, and gcc/clang 32 and 64-bit on the Godbolt compiler explorer. (And also non-x86 architectures). This is the 32-bit asm output from MSVC, so you could compare with what you get for your nasty wrapper functions. You can see that it's basically inlined the useful part (mov eax, uAddr / jmp or call) into the caller.

;; x86 MSVC -O3
$T1 = -4                                                ; size = 4
?foo@@YAXXZ PROC                                        ; foo
        push    ecx
        mov     eax, 7439760                          ; 00718590H
        call    eax

        lea     eax, DWORD PTR $T1[esp+4]
        mov     DWORD PTR $T1[esp+4], 2       ; the by-reference enum
        push    eax
        push    0                             ; the dummy nullptr
        mov     eax, 4637392                          ; 0046c2d0H
        call    eax

        push    97                                  ; 00000061H
        mov     eax, 4638256                          ; 0046c630H
        call    eax

        add     esp, 16                             ; 00000010H
        ret     0
?foo@@YAXXZ ENDP

For repeated calls to the same function, the compiler would keep the function pointer in a call-preserved register.


For some reason even with 32-bit position-dependent code, we don't get a direct call rel32. The linker can calculate the relative offset from the call-site to the absolute target at link time, so there's no reason for the compiler to use a register-indirect call.

If we didn't tell the compiler to create position-independent code, it's a useful optimization in this case to address absolute addresses relative to the code, for jumps/calls.

In 32-bit code, every possible destination address is in range from every possible source address, but in 64-bit it's harder. In 32-bit mode, clang does spot this optimization! But even in 32-bit mode, MSVC and gcc miss it.

I played around with some stuff with gcc/clang:

// don't use
oCMOB * CreateNewInstance(void) asm("0x00718590");

Kind of works, but only as a total hack. Gcc just uses that string as if it were a symbol, so it feeds call 0x00718590 to the assembler, which handles it correctly (generating an absolute relocation which links just fine in a non-PIE executable). But with -fPIE, we it emits 0x00718590@GOTPCREL as a symbol name, so we're screwed.

Of course, in 64-bit mode a PIE executable or library will be out of range of that absolute address so only non-PIE makes sense anyway.


Another idea was to define the symbol in asm with an absolute address, and provide a prototype that would get gcc to only use it directly, without @PLT or going through the GOT. (I maybe could have done that for the func() asm("0x..."); hack, too, using hidden visibility.)

I only realized after hacking this up with the "hidden" attribute that this is useless in position-independent code, so you can't use this in a shared library or PIE executable anyway.

extern "C" is not necessary, but means I didn't have to mess with name mangling in the inline asm.

#ifdef __GNUC__

extern "C" {
    // hidden visibility means that even in a PIE executable, or shared lib,
    // calls will go *directly* to that address, not via the PLT or GOT.
    oCMOB * CNI(void) __attribute__((__visibility__("hidden")));
}
//asm("CNI = 0x718590");  // set the address of a symbol, like `org 0x71... / CNI:`
asm(".set CNI, 0x718590");  // alternate syntax for the same thing


void *test() {
    CNI();    // works

    return (void*)CNI;  // gcc: RIP+0x718590 instead of the relative displacement needed to reach it?
    // clang appears to work
}
#endif

disassembly of compiled+linked gcc output for test, from Godbolt, using the binary output to see how it assembled+linked:

 # gcc -O3  (non-PIE).  Clang makes pretty much the same code, with a direct call and mov imm.
 sub    rsp,0x8
 call   718590 <CNI>
 mov    eax,0x718590
 add    rsp,0x8
 ret    

With -fPIE, gcc+gas emits lea rax,[rip+0x718590] # b18ab0 <CNI+0x400520>, i.e. it uses the absolute address as an offset from RIP, instead of subtracting. I guess that's because gcc literally emits lea CNI(%rip),%rax, and we've defined CNI as an assemble-time symbol with that numeric value. Oops. So it's not quite like a label with that address like you'd get with .org 0x718590; CNI:.

But since we can only use rel32 call in non-PIE executables, this is ok unless you compile with -no-pie but forget -fno-pie, in which case you're screwed. :/

Providing a separate object file with the symbol definition might have worked.

Clang appears to do exactly what we want, though, even with -fPIE, with its built-in assembler. This machine code could only have linked with -fno-pie (the default on Godbolt, not the default on many distros.)

 # disassembly of clang -fPIE machine-code output for test()
 push   rax
 call   718590 <CNI>
 lea    rax,[rip+0x3180b3]        # 718590 <CNI>
 pop    rcx
 ret    

So this is actually safe (but sub-optimal because lea rel32 is worse than mov imm32.) With -m32 -fPIE, it doesn't even assemble.



来源:https://stackoverflow.com/questions/52010509/replace-inline-assembly-tailcall-function-epilogue-with-intrinsics-for-x86-x64-m

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!