I have a simple function with an inner loop - it scales the input value, looks up an output value in a lookup table, and copies it to the destination. (ftol_ambient is a tri
How are you timing these routines? I wonder if paging or caching is having an effect on the timings? It's possible that calling the first routine loads both into memory, crosses a page boundary or causes the stack to cross into an invalid page (causing a page-in), but only the first routine pays the price.
You may want to to run through both functions once before making the calls that take the measurements to reduce the effects that virtual memory and caching might have.