I\'m building an interpreter and as I\'m aiming for raw speed this time, every clock cycle matters for me in this (raw) case.
Do you have any experience or informati
For decent results, use std::vector
as the backing storage and take a pointer to its first element before your main loop or whatever:
std::vector mem_buf;
// stuff
uint8_t *mem=&mem_buf[0];
for(;;) {
switch(mem[pc]) {
// stuff
}
}
This avoids any issues with over-helpful implementations that perform bounds checking in operator[]
, and makes single-stepping easier when stepping into expressions such as mem_buf[pc]
later in the code.
If each instruction does enough work, and the code is varied enough, this should be quicker than using a global array by some negligible amount. (If the difference is noticeable, the opcodes need to be made more complicated.)
Compared to using a global array, on x86 the instructions for this sort of dispatch should be more concise (no 32-bit displacement fields anywhere), and for more RISC-like targets there should be fewer instructions generated (no TOC lookups or awkward 32-bit constants), as the commonly-used values are all in the stack frame.
I'm not really convinced that optimizing an interpreter's dispatch loop in this way will produce a good return on time invested -- the instructions should really be made to do more, if it's an issue -- but I suppose it shouldn't take long to try out a few different approaches and measure the difference. As always in the event of unexpected behaviour the generated assembly language (and, on x86, the machine code, as instruction length can be a factor) should be consulted to check for obvious inefficiencies.