What techniques promote efficient opcode dispatch to make a fast interpreter? Are there some techniques that only work well on modern hardware and others that don\'t work we
Indirect threading is a strategy where each opcode implementation has its own JMP
to the next opcode. The patch to the Python interpreter looks something like this:
add:
result = a + b;
goto *opcode_targets[*next_instruction++];
opcode_targets
maps the instruction in the language's bytecode to the location in memory of the opcode implementation. This is faster because the processor's branch predictor can make a different prediction for each bytecode, in contrast to a switch
statement that has only one branch instruction.
The compiler must support computed goto for this to work, which mostly means gcc.
Direct threading is similar, but in direct threading the array of opcodes is replaced with pointers to the opcode implentations like so:
goto *next_opcode_target++;
These techniques are only useful because modern processors are pipelined and must clear their pipelines (slow) on a mispredicted branch. The processor designers put in branch prediction to avoid having to clear the pipeline as often, but branch prediction only works for branches that are more likely to take a particular path.