I\'m trying to speed up a variable-bitwidth integer compression scheme and I\'m interested in generating and executing assembly code on-the-fly. Currently a lot of time is spe
Very good question, but the answer is not so easy... Probably the final word will be for the experiment - common case in modern world of different architectures.
Anyway, what you want to do is not exactly self modifying code. The procedures "decode_x" will exists and will not to be modified. So, there should be no problems with the cache.
On the other hand, the memory allocated for the generated code, probably will be dynamically allocated from the heap, so, the addresses will be far enough from the executable code of the program. You can allocate new block every time you need to generate new call sequence.
How far is enough? I think that this is not so far. The distance should be probably a multiply of the cache line of the processor and this way, not so big. I have something like 64bytes (for L1). In the case of dynamically allocated memory you will have many of pages a distance.
The main problem in this approach IMO is that the code of the generated procedures will be executed only once. This way, the program will lost the main advance of the cached memory model - efficient execution of cycling code.
And at the end - the experiment does not look so hard to be made. Just write some test program in both variants and measure the performance. And if you publish these results I will read them carefully. :)