The C standard is quite unclear about the uint_fast*_t
family of types. On a gcc-4.4.4 linux x86_64 system, the types uint_fast16_t
and uint_
I think that such a design decision is not simple to take. It depends on many factors. For the moment I don't take your experiment as conclusive, see below.
First of all there is no such thing like one single concept of what fast should mean. Here you emphasized on multiplication in place, which is just one particular point of view.
Then x86_64 is an architecture and not a processor. So outcomes might be quite different for different processors in that family. I don't think that it would be sane that gcc would have the type decision depend on particular commandline switches that optimize for a given processor.
Now to come back to your example. I guess you have also looked at the assembler code? Did it e.g use SSE instructions to realize your code? Did you switch processor specific options on, something like -march=native
?
Edit: I experimented a bit with your test program and if I leave it exactly as it is I can basically reproduce your measurements. But modifying and playing around with it I am even less convinced that it is conclusive.
E.g if I change the inner loop also to go downward, the assembler looks almost the same as before (but using decrement and a test against 0) but the execution takes about 50% more. So I guess the timing depends very much on the environment of the instruction that you want to benchmark, pipeline stalls, whatever. You'd have to bench codes of very different nature where the instructions are issued in different contexts, alignment problems and vectorization come to play, to make a decision what the appropriate types for the fast
typedef
s are.