How can I write self-modifying code that runs efficiently on modern x64 processors?

前端 未结 4 905
深忆病人
深忆病人 2021-02-02 00:02

I\'m trying to speed up a variable-bitwidth integer compression scheme and I\'m interested in generating and executing assembly code on-the-fly. Currently a lot of time is spe

4条回答
  •  生来不讨喜
    2021-02-02 00:43

    This doesn't have to be self-modifying code at all - it can be dynamically created code instead, i.e. runtime-generated "trampolines".

    Meaning you keep a (global) function pointer around that'll redirect to a writable/executable mapped section of memory - in which you then actively insert the function calls you wish to make.

    The main difficulty with this is that call is IP-relative (as are most jmp), so that you'll have to calculate the offset between the memory location of your trampoline and the "target funcs". That as such is simple enough - but combine that with 64bit code, and you run into the relative displacement that call can only deal with displacements in the range of +-2GB, it becomes more complex - you'd need to call through a linkage table.

    So you'd essentially create code like (/me severely UN*X biased, hence AT&T assembly, and some references to ELF-isms):

    .Lstart_of_modifyable_section:
    callq 0f
    callq 1f
    callq 2f
    callq 3f
    callq 4f
    ....
    ret
    .align 32
    0:        jmpq tgt0
    .align 32
    1:        jmpq tgt1
    .align 32
    2:        jmpq tgt2
    .align 32
    3:        jmpq tgt3
    .align 32
    4:        jmpq tgt4
    .align 32
    ...
    

    This can be created at compile time (just make a writable text section), or dynamically at runtime.

    You then, at runtime, patch the jump targets. That's similar to how the .plt ELF Section (PLT = procedure linkage table) works - just that there, it's the dynamic linker which patches the jmp slots, while in your case, you do that yourself.

    If you go for all runtime, then table like the above is easily creatable through C/C++ even; start with a data structures like:

    typedef struct call_tbl_entry __attribute__(("packed")) {
        uint8_t call_opcode;
        int32_t call_displacement;
    };
    typedef union jmp_tbl_entry_t {
        uint8_t cacheline[32];
        struct {
            uint8_t jmp_opcode[2];    // 64bit absolute jump
            uint64_t jmp_tgtaddress;
        } tbl __attribute__(("packed"));
    }
    
    struct mytbl {
        struct call_tbl_entry calltbl[NUM_CALL_SLOTS];
        uint8_t ret_opcode;
        union jmp_tbl_entry jmptbl[NUM_CALL_SLOTS];
    }
    

    The only critical and somewhat system-dependent thing here is the "packed" nature of this that one needs to tell the compiler about (i.e. not to pad the call array out), and that one should cacheline-align the jump table.

    You need to make calltbl[i].call_displacement = (int32_t)(&jmptbl[i]-&calltbl[i+1]), initialize the empty/unused jump table with memset(&jmptbl, 0xC3 /* RET */, sizeof(jmptbl)) and then just fill the fields with the jump opcode and target address as you need.

提交回复
热议问题